[Lldb-commits] [PATCH] D24610: LLDB Arm Watchpoints: Use single hardware watchpoint slot to watch multiple bytes where possible

Thu Sep 22 05:57:56 PDT 2016

labath accepted this revision.

================
Comment at: source/Plugins/Process/Linux/NativeRegisterContextLinux_arm.cpp:513-521
@@ -513,1 +512,11 @@
+
+  // Find out how many bytes we need to watch after 4-byte alignment boundary.
+  uint8_t watch_size = (addr & 0x03) + size;
+
+  // We cannot watch zero or more than 4 bytes after 4-byte alignment boundary.
+  if (size == 0 || watch_size > 4)
+    return LLDB_INVALID_INDEX32;
+
+  // Strip away last two bits of address for byte/half-word/word selection.
+  addr &= ~((lldb::addr_t)3);
 
----------------
omjavaid wrote:
> labath wrote:
> > zturner wrote:
> > > This block of code is a bit confusing to me.  Is this equivalent to:
> > > 
> > > ```
> > > lldb::addr_t start = llvm::alignDown(addr, 4);
> > > lldb::addr_t end = addr + size;
> > > if (start == end || (end-start)>4)
> > >   return LLDB_INVALID_INDEX32;
> > > ```
> > I am not sure this is much clearer, especially, as we will later need a separate varaible for `end-start` anyway.
> > 
> > +1 for `llvm::alignDown` though.
> There is significant performance difference when we choose addr = addr & (~0x03); over llvm::alignDown
> 
> It will eventually effect responsiveness if we keep increasing code size like this. 
> 
> Even if we use -Os then alignDown is squeezed down to 8-9 instructions. I havnt tried clang though.
> 
> Instructions needed for addr = addr & (~0x03):
>   4008bd:	48 8b 45 e8          	mov    -0x18(%rbp),%rax
>   4008c1:	48 83 e0 fc          	and    $0xfffffffffffffffc,%rax
>   4008c5:	48 89 45 e8          	mov    %rax,-0x18(%rbp)
> 
> Call penalty for llvm::alignDown
> 400918:	ba 00 00 00 00       	mov    $0x0,%edx
>   40091d:	48 89 ce             	mov    %rcx,%rsi
>   400920:	48 89 c7             	mov    %rax,%rdi
>   400923:	e8 ae 00 00 00       	callq  4009d6 <_Z9alignDownmmm>
>   400928:	49 89 c4             	mov    %rax,%r12
>   40092b:	48 8b 5d d8          	mov    -0x28(%rbp),%rbx
> 
> Disassembly for llvm::alignDown
> 00000000004009d6 <_Z9alignDownmmm>:
>   4009d6:	55                   	push   %rbp
>   4009d7:	48 89 e5             	mov    %rsp,%rbp
>   4009da:	48 89 7d f8          	mov    %rdi,-0x8(%rbp)
>   4009de:	48 89 75 f0          	mov    %rsi,-0x10(%rbp)
>   4009e2:	48 89 55 e8          	mov    %rdx,-0x18(%rbp)
>   4009e6:	48 8b 45 e8          	mov    -0x18(%rbp),%rax
>   4009ea:	ba 00 00 00 00       	mov    $0x0,%edx
>   4009ef:	48 f7 75 f0          	divq   -0x10(%rbp)
>   4009f3:	48 89 55 e8          	mov    %rdx,-0x18(%rbp)
>   4009f7:	48 8b 45 f8          	mov    -0x8(%rbp),%rax
>   4009fb:	48 2b 45 e8          	sub    -0x18(%rbp),%rax
>   4009ff:	ba 00 00 00 00       	mov    $0x0,%edx
>   400a04:	48 f7 75 f0          	divq   -0x10(%rbp)
>   400a08:	48 0f af 45 f0       	imul   -0x10(%rbp),%rax
>   400a0d:	48 89 c2             	mov    %rax,%rdx
>   400a10:	48 8b 45 e8          	mov    -0x18(%rbp),%rax
>   400a14:	48 01 d0             	add    %rdx,%rax
>   400a17:	5d                   	pop    %rbp
>   400a18:	c3                   	retq   
>   400a19:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
> 
> Number of instructions generated for alignDown with gcc -Os
> 
>   400892:	48 8b 6c 24 08       	mov    0x8(%rsp),%rbp
>   400897:	48 8b 4c 24 10       	mov    0x10(%rsp),%rcx
>   40089c:	31 d2                	xor    %edx,%edx
>   40089e:	be c2 0a 40 00       	mov    $0x400ac2,%esi
>   4008a3:	bf a0 11 60 00       	mov    $0x6011a0,%edi
>   4008a8:	48 89 e8             	mov    %rbp,%rax
>   4008ab:	48 f7 f1             	div    %rcx
>   4008ae:	48 0f af c1          	imul   %rcx,%rax
>   4008b2:	48 89 c3             	mov    %rax,%rbx
I tried this with clang-3.6 -O3. The entire alignDown function call compiled down to `andq $-4, %rdi`. I'll leave this up to you, as I think it is readable enough right now, but I don't think we should be afraid of using utility functions like this. I think LLVM cares a lot more about performance than we, so I believe we can rely on them knowing what they're doing. Also we have much bigger higher-level issues affecting performance, the least of which is the change below, where you re-calculate all watchpoints on every resume.

================
Comment at: source/Plugins/Process/Linux/NativeRegisterContextLinux_arm.cpp:552
@@ -559,1 +551,3 @@
 
+  uint8_t current_watch_size, new_watch_size;
+  // Calculate overall size width to be watched by current hardware watchpoint slot.
----------------
omjavaid wrote:
> labath wrote:
> > Looks much better. Any reason for not using `NextPowerOf2` ? Among other things, it is self-documenting, so you do not need the comment above that.
> so llvm::NextPowerOf2 doesnt serve the intended behaviour.
> 
> llvm::NextPowerOf2 returns 2 for 1, 4 for 2 or 3, and 8 for 4.
> 
> We just want to make sure that our new_size is a power of 2 if its already not which will only be the case if size turns out to be 3.
> 
OK, nevermind then.

================
Comment at: source/Plugins/Process/Linux/NativeThreadLinux.cpp:202
@@ +201,3 @@
+  // Invalidate watchpoint index map to re-sync watchpoint registers.
+  m_watchpoint_index_map.clear();
+
----------------
omjavaid wrote:
> labath wrote:
> > If you add this, then the comment below becomes obsolete.
> > 
> > Seems like a pretty elegant solution to the incremental watchpoint update problem. I am wondering whether we need to do it on every resume though. I think it should be enough to do it when a watchpoint gets deleted (`NTL::RemoveWatchpoint`). Also, we should throw out the implementation of `NativeRegisterContextLinux_arm::ClearHardwareWatchpoint` -- it's no longer necessary, and it's not even correct anymore.
> This can be improved for performance I intentionally didn't do it to minimize changes to the generic component. 
I don't care about the performance too much - the code isn't that hot.

What bothers me is that this leaves the code in an inconsistent state - NativeRegisterContextLinux_arm::ClearHardwareWatchpoint thinks it is doing the watchpoint removal "the old way" whereas what it does actually does not matter, as we will nuke the watchpoint registers anyways. Ideally, I'd like to see this done in the opposite order - first switching the code to use the "nuking" approach to removing watchpoints, and after that adding the slot reuse code (at which point it will not need to touch any generic code at all). If you want to do it the other way then go ahead, but I do expect to see a follow-up change to clean this up.


https://reviews.llvm.org/D24610