[PATCH][AArch64] Prefer ldp x, x to ldr q

Mon Aug 4 06:38:47 PDT 2014

Hi Tim,

[cc. Arnold as this affects SLP vectorizer rather than just the cost model]

The attached patch attempts to fix this in a non-hacky way.

The intent is to add explicit modelling of the costs involved in keeping values live over a callsite. The patch causes the SLP Vectorizer to scan its generated tree bottom-up, keeping track of all values live. When it encounters a call instruction (that is not part of the tree), it calls out to a new TTI hook.

Most architectures will use the NoAA version of this hook which just returns zero cost, but AArch64 returns the cost of a spill and fill if a 128-bit vector type is used.

This algorithm is conservative and may not catch all cases. For example:

A:
  X = load ...
  Goto B
B:
  Call ...
  Goto C
C:
  Store X

Because there are no instructions within the SLP tree in block B, it will not see the call instruction. This is a limitation due to the difficulty of finding the "right path" from block C to block A without any helping information. In practice I don't see this as a large limitation - a conservative heuristic is still better than no heuristic (or a badly-modelled heuristic).

What do you think?

Cheers,

James

-----Original Message-----
From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-bounces at cs.uiuc.edu] On Behalf Of James Molloy
Sent: 02 August 2014 18:00
To: 'Tim Northover'
Cc: llvm-commits
Subject: RE: [PATCH][AArch64] Prefer ldp x, x to ldr q

Hi Tim,

I didn't do it that way, I did it a more braindead way because I was feeling braindead at the time. However, you've convinced me that it's a hack rather than a fudge. (hack > fudge > "heuristic" > heuristic ?)

I have a prototype patch that adds a target hook to describe the cost of keeping a set of types live over a call. For most targets this would be zero, for AArch64 (128-bit vectors only) it'd be the cost of a spill and fill. I've then taught the SLP vectorizer to find the set of live values at all CallInsts between the first and last instructions, which was more code than I'd hoped.

Erik's recent SLP change has seemingly broken all my testcases though, so in the meantime I've committed the non-contentious DAGCombine part.

Cheers, and thanks for pushing me in the right direction,

James

-----Original Message-----
From: Tim Northover [mailto:tnorthover at apple.com] 
Sent: 01 August 2014 18:03
To: James Molloy
Cc: Tim Northover; Chad Rosier; llvm-commits
Subject: Re: [PATCH][AArch64] Prefer ldp x, x to ldr q

Hi James,

> I just extended it to work with all 128-bit loads, and that caused some bad behaviour.
> 
> What we're saying is (assume one scalar load has a cost of 1):
> <2 x i32> costs 1
> <4 x i32> costs 4

How did you extend it? I’d expect you to return 2 for <4 x i32>, along some kind of 2*<64-bit cost> algorithm, rather than N*<scalar>, if we’re really pretending to model some ldp effect.

> So I think it only applies to <2 x i64> or <2 x double>. And yes, this whole thing is making me feel very dirty inside - if there's a better way, I don't know of it :(

I think if we really can only make it apply to the 64-bit element case, that’s the strongest evidence yet that the whole approach is wrong. How can it be OK to merge loads to form a <4 x i32> ldr, but not a <2 x i64> one? They’re exactly the same instruction.

Cheers.

Tim.

_______________________________________________
llvm-commits mailing list
llvm-commits at cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spillcost.diff
Type: application/octet-stream
Size: 6988 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140804/e1e1b920/attachment.obj>