[Lldb-commits] [lldb] [lldb] Fix and speedup the `memory find` command (PR #104193)

Wed Aug 28 08:18:33 PDT 2024

labath wrote:

> > It also reads the memory in bulk (up to 1MB)
> 
> ```
> // Maximum number of bytes read (and buffered). We need to read at least
> // `size` bytes for a successful match.
> const size_t max_read_size = std::max<size_t>(size, 0x10000);
> ```
> 
> It seems the minimal chunk is 64KB and the maximal chunk may be very long (more than 1MB).

Depends on how you look at it, I guess. The actual read size will be the minimum of this number and the "remainder of memory to be searched" (`std::min(mem.size(), high - s)`), so `max_read_size` is really a cap on the read size. It's true that the value can be bigger than 64K (or even 1MB). However that can only happen if the string being searched is larger than that value. This is necessary because the search algorithm matches from the end of the pattern.

I wanted to keep the algorithm simple and keep the entire (potential) match in memory at once, as trying to read in the middle of matching would get messy. I don't think anyone searches for strings this long, but if we do want to (better) support that use case, I think that the best solution would be to switch to an algorithm (e.g. [KMP](https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm)) which scans the search space in a strictly linear fashion -- as that will make it easy to read the memory in arbitrary increments.

> 
> Something is wrong with the first memory block in tests
> 
> ```
> Got output:
> data found at location: 0x7f6bdf3eb000
> 0x7f6bdf3eb000: 6e 65 65 64 6c 65 00 00 00 00 00 00 00 00 00 00  needle..........
> 0x7f6bdf3eb010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> data found at location: 0x7f6bdf3eb800
> 0x7f6bdf3eb800: 6e 65 65 64 6c 65 00 00 00 00 00 00 00 00 00 00  needle..........
> 0x7f6bdf3eb810: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> data found at location: 0x7f6bdf3edff9
> 0x7f6bdf3edff9: 6e 65 65 64 6c 65 00 50 e0 3e df 6b 7f 00 00 6d  needle.P.>.k...m
> 0x7f6bdf3ee009: dd 41 df 6b 7f 00 00 00 00 00 00 00 00 00 00 40  .A.k...........@
> no more matches within the range.
> 
> Expecting sub string: "data found at location: 0x7f6bdf3e9000" (was not found)
> ```
> 

Interesting. It does work on my machine. I'll try to debug this.

> Note the capacity=0 is a special case here `llvm::SmallVector<uint8_t, 0> mem;` I'd recommend
> 
> ```
> const size_t min_read_size = 0x10000;
> const size_t max_read_size = std::max<size_t>(size, min_read_size);
> llvm::SmallVector<uint8_t, min_read_size> mem;
> ```

That would allocate the 64k on the stack, which isn't good practice (I doubt it will make things faster, and it can cause us to blow the stack). 64k is  1000x larger than the [preffered small vector size](https://github.com/llvm/llvm-project/blob/16910a21ee0fabab2df291e4e5bc18289bd5762d/llvm/include/llvm/ADT/SmallVector.h#L1150C27-L1150C54). In practice there will only ever be one heap allocation here -- first `resize` call on the vector will allocate a ~64k heap block, and all subsequent resizes will just move the end pointer around.

https://github.com/llvm/llvm-project/pull/104193