[PATCH][AArch64] Prefer ldp x, x to ldr q

Mon Jul 28 09:47:01 PDT 2014

Hi Tim,

Thanks for the review! You raise a good point.

The *aim* is to model that there is a cost to keeping a Q register live across a call. In fact, it would be significantly better if I were to teach the vectorizer that creating a vector of that type that is live across a call has a cost. That's quite difficult however (although probably doable with a new hook).

I thought my solution sufficient - I'm not saying that loading 128-bit values is expensive - just that there is no *benefit* in doing so if you're transforming two 64-bit loads to a 128-bit load. SLP vectorizer's cost model stores differences versus the equivalent scalar sequence, so by returning the cost of the two 64-bit loads added together makes it zero cost and zero benefit - that is, it shouldn't enter into the cost equation at all.

While this is a slight fudge, I don't see it as a hack personally.

> The cost also doesn't seem to match what's really going on. At least on Cyclone, cost(ldr qD) == cost(ldp dD1, dD2) == cost(ldr dD1; ldr
dD2)/2 (approximately)

Yes, but we need to factor into the cost model what the backend will do - it will convert "ldr dD1; ldr dD2" into an ldp, so the TTI should return the cost of the ldp, surely?

Cheers,

James

-----Original Message-----
From: Tim Northover [mailto:t.p.northover at gmail.com] 
Sent: 28 July 2014 17:13
To: Chad Rosier
Cc: James Molloy; Tim Northover; llvm-commits
Subject: Re: [PATCH][AArch64] Prefer ldp x, x to ldr q

> Also, I believe you used Tim's old address.  Forwarding to Tim's 
> current address.

He'd already done that. I'm just not quite sure it's the obviously right thing to do in all cases.

The DAG one is reasonably convincing on its own (as James says, it's close enough to why hasPairedLoad exists). I'd second a LGTM on that one, in fact.

The TTI one, though, looks iffy. For a start it only covers 64-bit element types, while the question seems like it'd be relevant to fusing vectors regardless of source.

The cost also doesn't seem to match what's really going on. At least on Cyclone, cost(ldr qD) == cost(ldp dD1, dD2) == cost(ldr dD1; ldr
dD2)/2 (approximately). So it's not that loading a 128-bit value is particularly expensive (which might skew other comparisons where it's used), but that there's special dispensation for pairs of 64-bit values. I don't know about other cores, but James's initial comments suggest it might be similar for some he knows about.

Cheers.

Tim.