[PATCH] D12154: [x86] invert logic for attribute 'FeatureFastUAMem'

Wed Aug 19 12:24:56 PDT 2015

silvas added a subscriber: silvas.

================
Comment at: lib/Target/X86/X86.td:489
@@ -486,1 +488,3 @@
+                               FeatureFSGSBase, FeatureSlowUAMem]>;
 
+def : Proc<"geode",           [FeatureSlowUAMem, Feature3DNowA]>;
----------------
RKSimon wrote:
> spatel wrote:
> > RKSimon wrote:
> > > You can drop FeatureSlowUAMem for BD targets - the AMD 15h SOG confirms that unaligned performance should be the same for aligned addresses and only +1cy for unaligned. It might be more complex for cache-line crossing but most targets will suffer there, not just BD.
> > Thanks, Simon. Can we make the same argument for AMD 16H? I was planning to fix these up in the next patch and add test cases since that would be a functional change (FIXME at line 445).
> Yes I'm happy for any changes to made in a followup patch.
> 
> Jaguar (16h) is definitely as fast for unaligned load/stores with aligned addresses and +1cy for unaligned.
> 
> IIRC Bobcat you could do fast unaligned loads (as long as the SSE unaligned flag was set). I think there was something about stores that you had to be careful with though. This is probably the same for all AMD 10h/12h families.
To expand a bit on what Simon said, for Jaguar, the LD/ST unit performs a (naturally aligned) 16-byte access to L1D each cycle, so the actual rule is that you can be as unaligned as you want as long as you remain within a 16-byte chunk, with no penalty.

Basically if your store crosses a 16-byte chunk you are forcing LD/ST to do an extra round to L1D, which is where the +1cy Simon was talking about comes from. Just considering the mechanism at play here, the compiler can't hope to do anything differently/better in general for the unaligned case, so just let the hardware take care of it if it happens :)


http://reviews.llvm.org/D12154