<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Jan 16, 2012, at 8:58 PM, Chandler Carruth wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div class="gmail_quote">On Mon, Jan 16, 2012 at 8:32 PM, Chris Lattner <span dir="ltr"><<a href="mailto:clattner@apple.com">clattner@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto; ">
<div class="im"><br>
On Jan 16, 2012, at 5:24 PM, Jakob Stoklund Olesen wrote:<br>
<br>
> Author: stoklund<br>
> Date: Mon Jan 16 19:24:32 2012<br>
> New Revision: 148272<br>
><br>
> URL: <a href="http://llvm.org/viewvc/llvm-project?rev=148272&view=rev" target="_blank">http://llvm.org/viewvc/llvm-project?rev=148272&view=rev</a><br>
> Log:<br>
> Add portable bit mask operations to BitVector.<br>
><br>
> BitVector uses the native word size for its internal representation.<br>
> That doesn't work well for literal bit masks in source code.<br>
><br>
> This patch adds BitVector operations to efficiently apply literal bit<br>
> masks specified as arrays of uint32_t. Since each array entry always<br>
> holds exactly 32 bits, these portable bit masks can be source code<br>
> literals, probably produced by TableGen.<br>
<br>
</div>Out of curiosity, why not arrays of uint64_t? It will be faster on 64-bit platforms, and shouldn't really be a penalty on 32-bit either.<br></blockquote></div></blockquote><div><br></div><div>It's a speed / size tradeoff. The typical use case is bit vectors of physical registers, which means a bit vector with 160 entries on x86. Using uint64_t would mean a 20% space overhead compared to uint32_t.</div><div><br></div><div>The inner loop of applyMask was supposed to unroll and optimize to 64-bit operations with an unaligned load as the only regression. LLVM's optimizers disagree, though, and we get:</div><div><br></div><div><div>LBB14_2: ## %for.body</div><div> ## =>This Inner Loop Header: Depth=1</div><div> movl (%rsi), %ebx</div><div> orq (%r8,%r9,8), %rbx</div><div> movl 4(%rsi), %eax</div><div> shlq $32, %rax</div><div> orq %rbx, %rax</div><div> movq %rax, (%r8,%r9,8)</div><div> addq $8, %rsi</div><div> incq %r9</div><div> addl $-2, %edi</div><div> cmpl $1, %edi</div><div> ja LBB14_2</div><div><br></div><div>I couldn't trick LLVM into combining the two 32-bit loads, presumably because of missing alignment.</div><div><br></div><div>On 32-bit architectures, the applyMask code is much smaller.</div></div><div><br></div><blockquote type="cite"><div class="gmail_quote"><div>I wondered the same thing. I also wondered about an array of indices rather than an array of bits.. Specifically, an array of indices would seem easier to read, and if they're all likely to be literals, I would expect the optimizer to make them all equivalent...</div>
</div>
</blockquote></div><br><div>That would only work when the literals are directly visible to the optimizer. In this case, pointers will be passed through virtual functions. I am not concerned with readability since the bit masks will be produced by TableGen.</div><div><br></div><div>BitVector already supports this use case well with the existing methods. See for example the targets' getReservedRegs() implementations.</div><div><br></div><div>/jakob</div><div><br></div></body></html>