<div dir="ltr">On 9 January 2013 17:10, Nadav Rotem <span dir="ltr"><<a href="mailto:nrotem@apple.com" target="_blank">nrotem@apple.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

For example:<br>

        "opt -cost-model -analyze dumper.ll -mtriple=thumbv7 -mcpu=cortex-a15"<br>

<br>

I also run the vectorizer with -debug-only=loop-vectorize because it dumps the costs of all of the instructions with different vectorization factors, and it also detects the different kinds of shuffles that we support.<br>

</blockquote><div><br></div><div style>Hi Nadav,</div><div style><br></div><div style>These are great ways of debugging the cost model!</div><div style><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

The LoopVectorizerCostModel only predicts which IR will be generated when vectorizing to a specific vector width. It uses TTI to get the cost of each IR instruction. Chandler recently refactored TTI (thank!) and now TTI is an analysis group. The BasicTTI attempts to handle all of the target independent logic. It uses the TargetLowering interface to check if the types are legal and how many times large vectors need to be split. Different targets need to implement the cases that the BasicTTI does not catch. For example, the cost of zext <8xi8> to <8 x i32> which is custom lowered on some targets.<br>

</blockquote><div><br></div><div style>I'm also thinking about the individual instructions cost (getArithmeticInstrCost, getShuffleCost, etc). That can be a simple and easily parallelized task. I got the A9 manual that has the cost of all instructions (including NEON and VFP), that should give us a head start.</div>

<div style><br></div><div style>I'm guessing the cost you already have for Intel and the BasicTTI is in "ideal cycle count", not taking into consideration the time available to get the results or pipeline stalls, etc. In the end, when the model is complete, it doesn't matter much the individual numbers, as long as they scale equally, but for now, while we're still relying on BasicTTI, we should follow a similar approach.</div>

<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

I am not aware of anything that we can do in regard to context switches. Do you mean the cost of moving GPR to NEON ? Its a good point. We need to increase the cost of insert/extract vector. It should be easy to model and we have all of the hooks already.<br>

</blockquote><div><br></div><div style>Yes, and pipeline stalls, and intra-instruction behaviour, and A9 oddities, but that's all blue sky ideas for now. I don't think it'll be a hard engineering problem to know where to put the code, but it won't be easy to get some things right without badly breaking others. Let's be conservative for now... ;)</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">We can use the Subtarget when we implement the hooks. This is an example from the ARMTTI</blockquote>

<div><br></div><div style>Yes, this direct access is very convenient. For now, I'll focus on A9 and later we can add the subtleties of each sub-target.</div><div style><br></div><div style>cheers,</div><div style>--renato </div>

</div></div></div>