<div dir="ltr">Hi Tim.<div><br></div><div>Nice to hear from you again. It's been a while.</div><div><br></div><div>Let me start by stating the more general problem we are having. Currently there is no range checking</div>
<div>of any kind done for PC relative offsets. This leads to some pretty hilarious results. One can apparently</div><div>encode gigantic branches in meagre 16-bit instructions. For example:</div><div><br></div><div><div><font face="courier new, monospace">echo "b #4294967295" | build/bin/llvm-mc -triple thumbv7 -show-encoding</font></div>
<div><font face="courier new, monospace"><span class="" style="white-space:pre"> </span>.text</font></div><div><font face="courier new, monospace"><span class="" style="white-space:pre"> </span>b<span class="" style="white-space:pre"> </span>#<font color="#ff0000">4294967295</font> @ encoding: [<font color="#ff0000">0xff,0xe7</font>]</font></div>
</div><div><br></div><div>This happens because Thumb and Thumb2 have multiple encodings for certain instructions. The flavours</div><div>vary, but in brief there are four main variates:</div><div><br></div><div>1. Thumb 16-bit predicable instruction taking short offset</div>
<div>2. Thumb 16-bit non predicable instruction taking medium offset</div><div>3. Thumb2 32-bit predicable instruction taking larger offset</div><div>4. Thumb2 32-bit non-predicable instruction taking really large offset</div>
<div><br></div><div>One good example of such instruction is the branch which exhibits all these variants (encodings T1 through T4).</div><div>in be absence of range checking, MC is free to pick any encoding for any offset size. In the case of branch, it</div>
<div>always picks encoding T1 or T2. There are many instructions which have multiple PC-relative forms. A few of </div><div>the top of my head: branch, branch linked, load literal, store literal, adr, preload instruction. </div>
<div><blockquote type="cite" style="color:rgb(80,0,80);font-family:arial,sans-serif;font-size:13px"><div style="font-size:12px">This seems rather general given that it's just used once. I'd suggest<br>either stripping it back or doing a separate refactoring to<br>
consolidate things.</div></blockquote></div><div>The template function I am introducing is simply building infrastructure for this. At this time it is used only once,</div><div>but it will be used by later patches with different template arguments. Had I fixed definition for multiple instructions</div>
<div>in one patch you would have rightly objected and asked me to split them. However, it is true I should have dropped </div><div>a note about the intended use of the template function.</div><div><blockquote type="cite" style="color:rgb(80,0,80);font-family:arial,sans-serif;font-size:13px">
<div style="font-size:12px">Also, unless I'm mistaken its constraints are appropriate for 2s<br>complement arithmetic while ARM usually uses sign/magnitude for its<br>offsets. At the very least this assumption should be commented.</div>
</blockquote></div><div>Please correct me if I'm wrong, but ARM does indeed use 2's complement arithmetic! There is indeed one slight</div><div>deviation in that INT32_MIN is sometimes regarded as negative zero (maybe that's a relic for a distant dinosaur</div>
<div>that used direct representation?) Either way, all ARM cores that I can think of use 2's complement with this little </div><div>occasional quirk. Am I wrong here? </div><div><br></div><div>In terms of offset encoding, let's look :</div>
<div><br></div><div><font face="courier new, monospace">b<span class="" style="white-space:pre"> </span>#<font color="#ff0000">-4</font> @ encoding: [<font color="#ff0000">0xfe</font>,0xe7]<br></font></div>
<div><font face="courier new, monospace">b<span class="" style="white-space:pre"> </span>#<font color="#ff0000">-2</font> @ encoding: [<font color="#ff0000">0xff</font>,0xe7]<br></font></div><div><font face="courier new, monospace">b<span class="" style="white-space:pre"> </span>#<font color="#ff0000">0</font> @ encoding: [<font color="#ff0000">0x00</font>,0xe0]<br>
</font></div><div><font face="courier new, monospace">b<span class="" style="white-space:pre"> </span>#<font color="#ff0000">2</font> @ encoding: [<font color="#ff0000">0x01</font>,0xe0]<br></font></div>
<div><font face="courier new, monospace">b<span class="" style="white-space:pre"> </span>#<font color="#ff0000">4</font> @ encoding: [<font color="#ff0000">0x02</font>,0xe0]</font><br></div><div><br></div>
<div>Keeping in mind that that branch offsets are scaled by one, what is represented are actually -2, -1, 0, 1 and 2</div><div>in precise 2's complement representation.</div><div><blockquote type="cite" style="color:rgb(80,0,80);font-family:arial,sans-serif;font-size:13px">
<div style="font-size:12px">+ return MO.getImm() >> 2;<br><br>I think the preferred approach is to use an already-shifted MCOperand<br>so that invalid instructions can't exist by design. That removes the<br>need for the decoder, but makes your AsmParser render methods (which<br>
currently seem pretty much identical to addImm) useful again. CodeGen<br>may also need changing if it ever generates a truly immediate ADR<br>(doubtful).</div></blockquote></div><div>I don't disagree that having the operand pre-shifted doesn't make it more difficult to generate invalid MCInsts.</div>
<div>But that doesn't mean that invalid instructions can't exist "by design". It is true that "unaligned" immediate</div><div>operands can't exist by design, but an operand can still be invalid while being aligned - for example by being out</div>
<div>of range. </div><div><br></div><div>The only way to guarantee that, at any point in its lifetime a MCInst is not invalid involves:</div><div><br></div><div>1. creating it in a single place, with a single well tested method</div>
<div>2. emitting it from a single place with a single tested method</div><div>3. keeping the instruction constant throughout the lifetime of the MC component</div><div><br></div><div>Guaranteeing this is indeed a tall order! The point here is that whether keeping an integer shifted or unshifted</div>
<div>really doesn't do much for the correctness of anything. </div><div><br></div><div>Leaving this argument aside, I believe that it is actually better to keep the true number of the field in an MCInst</div><div>rather than its machine representation, but I will grant you this is sort of religious issue. It's a war I am not </div>
<div>prepared to fight :)</div><div><br></div><div>For a more practical reasoning please look at the immediate form of AND. It uses "modified immediate" fields.</div><div>These operands are in fact represented as "normal" numbers and then rendered as "modified" 12 or 8 bit fields</div>
<div>at encode time. There is a good reason for it too - imagine debugging MC and wanting to check the value of</div><div>such a field using a debugger! You would stand to chance. It's no longer an issue of shifting some bits, or masking</div>
<div>some bits out. It's is really complex and makes no immediately apparent sense! Also the logic for "decoding" these</div><div>numbers should be moved to the printer instead of the encoder -- but this is an encoding problem!</div>
<div><br></div><div>In the interest of uniformity and readability, the decision of how to represent immediate fields should have a single</div><div>answer. I believe that that answer is "straight, natural" representation, but as I said, I'm not looking to fight a war</div>
<div>over this. Just adding my two pence to the discussion.</div><div><br></div><div>Regards,</div><div>Mihai</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Jun 26, 2013 at 6:47 PM, Jim Grosbach <span dir="ltr"><<a href="mailto:grosbach@apple.com" target="_blank">grosbach@apple.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><br><div><div class="im">
<div>On Jun 26, 2013, at 9:37 AM, Tim Northover <<a href="mailto:t.p.northover@gmail.com" target="_blank">t.p.northover@gmail.com</a>> wrote:</div><br><blockquote type="cite"><div style="font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">
Hi Mihail,<br><br>+ template<unsigned width, unsigned scale, bool positive><br>+ bool isMemoryOffset() const {<br><br>This seems rather general given that it's just used once. I'd suggest<br>either stripping it back or doing a separate refactoring to<br>
consolidate things.<br><br>Also, unless I'm mistaken its constraints are appropriate for 2s<br>complement arithmetic while ARM usually uses sign/magnitude for its<br>offsets. At the very least this assumption should be commented.<br>
<br>+ return MO.getImm() >> 2;<br><br>I think the preferred approach is to use an already-shifted MCOperand<br>so that invalid instructions can't exist by design. That removes the<br>need for the decoder, but makes your AsmParser render methods (which<br>
currently seem pretty much identical to addImm) useful again. CodeGen<br>may also need changing if it ever generates a truly immediate ADR<br>(doubtful).<br><br></div></blockquote><div dir="auto"><br></div></div><div dir="auto">
Yes, exactly right.</div><span class=""><font color="#888888"><div dir="auto"><br></div><div dir="auto">-Jim</div></font></span></div></div><br>_______________________________________________<br>
llvm-commits mailing list<br>
<a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>
<br></blockquote></div><br></div></div>