[PATCH] D12154: [x86] invert logic for attribute 'FeatureFastUAMem'

Zia Ansari via llvm-commits llvm-commits at lists.llvm.org
Wed Aug 19 12:31:00 PDT 2015


zansari added a comment.

Hi Sanjay,

Functionality-wise, your changes LGTM (that is, they do what you're intending them to do).

I do think, however, that the code needs a little tweaking down the road. Some random comment in no particular order (all relating to Intel processors.. I'm not too familiar with others) that will hopefully answer some of your questions:

- There are actually a couple different attributes that we care about:
  - Fast unaligned "instructions". The movups instruction used to be very slow and was always to be avoided before NHM (SLM on the small core side). After NHM/SLM, unaligned instructions were just as fast as aligned instructions, provided that the access doesn't split cache lines.
  - FastER unaligned memory accesses that split the cache. At the same time, the penalty associated with memory accesses that do split the cache lines was reduced.

Since these two attributes are set/unset on the same H/W, we might be able to get away with just the 1 attribute.

This means, however, that the attribute name is slightly mislabeled (not a big deal) and, more importantly, the statement you made "..became fast enough that we can happily use them whenever we want." isn't entirely true.  This is true for using the unaligned instructions for cases that don't split the cache, but there is still a penalty for cases when we do split the cache. This is true for all 8B and 16B accesses, and 32B accesses that split either 16B half on anything below HSW (since 32B accesses are double pumped), and split anywhere on HSW and above (where we do full 32B accesses on L0).

We can be more "clumsy" in unaligned memory accesses with modern H/W, but we can't completely ignore splits. For example:

  loop:
      vmovups %ymm, array+64(%rcx)    // 0 mod 32
      vmovups %ymm, array+128(%rcx)  // 0 mod 32

Compared with

  loop:
      vmovups %ymm, array+48(%rcx)    // 16 mod 32
      vmovups %ymm, array+112(%rcx)  // 16 mod 32

Is around 3x slower on a HSW. On anything below HSW, the performance is equal due to double pumping the accesses, but suffer similar slowdowns if we split the 16B parts (similar to other pure 16B references)

Doing this:

  loop:
      vmovups %xmm, array+48(%rcx)
      vmovups %xmm, array+64(%rcx)
      vmovups %xmm, array+112(%rcx)
      vmovups %xmm, array+128(%rcx)

.. changes performance to be right in between the two loops above.

My initial thoughts on heuristics:

- In general, I think that if we know what the alignment is and we know we will split a cache line, we should use 2 instructions to avoid any penalties.
- If we don't know the alignment and want to minimize ld/st counts by using larger instructions, it can be worth the gamble on NHM/SLM+ architectures.
- On H/W before NHM/SLM, we should avoid unaligned instruction.

Hope this helps, and sorry for the long response.

Thanks,
Zia.


http://reviews.llvm.org/D12154





More information about the llvm-commits mailing list