[PATCH] D23909: [X86] Remove DenseMap for storing FMA3 grouping information

Mon Aug 29 16:23:20 PDT 2016

v_klochkov added a comment.

`Hi Craig,

Unfortunately, you were not aware of my bigger plans for the classes X86InstrFMA3Info/X86InstrFMA3Group (yo planned to remove them in your change-set).

I have some parts that I like in your patch, 
for example, the usage of TSFlags instead of KMergeMasked, KZeroMasked bits in X86InstrFMA3Group::Attrubutes,

There are though some serious concerns from my side as your patch would ruin my plans to re-use the classes X86InstrFMA3Info/X86InstrFMA3Group (added previously in https://reviews.llvm.org/rL278431) 
in some new very advanced FMA optimization I am developing right now.
Please, see my concerns in (2) and especially (3) below.

Also, adding 3 bits to TSFlags specifically for FMA3 opcodes probably would require X86 component owner's attention/review.

At the end of this message I added some ideas how to make the current classes X86InstrFMA3{Info/Group} a bit more efficient and better usable in future.

1. Minor. It is good that the DenseMap is replaced with static arrays and there is no initialization overhead now.

  The opcodes in the arrays are sorted right after initialization, and you rely on the order in which the opcodes are created for *.td files. That seems a bit hacky/risky to me, but is Ok as you added is_sorted() checks.

  Having explicit call to std::sort()-like functions to eliminate such assumptions would add the overhead and would kill the main point in having these changes.

2. Major. FMA3 is just a small sub-set of X86 instructions. I think FMA3 instructions do not deserve 3 bits in TSFlags. You used bits #55,56,57. Only 6 bits left for some probably more important features, after that the bit field should be extended and that would affect all opcodes.

  a) 2 bits: {NotFMA3, FMA3_132, FMA3_213, FMA3_231}, I don't have very strong opinion about these 2 bits for FMA3 but think this is too many bits for just FMA3 opcodes.

  b) The 1 bit 'FMA3Intrinsic' is definitely too big gift for FMA3 opcodes. I would rather vote for 'IsIntrinsic' bit, which would be set for all *_Int opcodes such as 'ADDSDrr_Int', etc.

3. MAJOR. I am implementing a new advanced FMA optimization and planned to add more bits to X86InstrFMA3Group::Attributes (being removed in change-set).

  a) I planned to add information about signs (i.e. FMA, FMS, FNMA, FNMS, FMADDSUB, FMSUBADD) Adding new fields to X86InstrFMA3Group::Attributes is quite natural and Ok, but adding 3 more bits to TSFlags is not an option. 	 b) The optimization also needs to operate with MVT and I planned to add an MVT field to X86InstrFMA3Group::Attributes as well. Currently, I don't see how to extract information from Opcode about the number and size of processed elements. (i.e. f32, f64, v4f32, v2f64, ...).

  So, with (a) and (b) that optimization would do requests to X86InstrFMA3Info to get FMA opcodes: unsigned getFMA213Opcode(bool IsEVEX, bool MulSign, bool AddSign, MVT VT, bool KMasked=false, bool KZMasked=false); or alternatively: unsigned getFMA213Opcode(bool IsEVEX, MVT VT, bool KMasked=false, bool KZMasked=false); unsigned getFMS213Opcode(bool IsEVEX, MVT VT, bool KMasked=false, bool KZMasked=false); unsigned getFNMA213Opcode(bool IsEVEX, MVT VT, bool KMasked=false, bool KZMasked=false); unsigned getFNMS213Opcode(bool IsEVEX, MVT VT, bool KMasked=false, bool KZMasked=false);

  for example, bool UseEVEX = HasEVEX && HasVL; unsigned Opc213 = getFMA213Opcode(UseEVEX, true, false, MVT::v4f64); // Opc213 is initialized with (UseEVEX ? VFNMADD213PDZ256r : VFNMADD213PDYr).

Some ideas:
I was thinking about using a different approach to reduce the number
of small static arrays/FMA3Groups (it was one of your major concerns in https://reviews.llvm.org/rL278431)
FMA3Groups could be consisting of the bigger number of fields, i.e.:

  // VEX (AVX2) opcodes.
  uint16_t VEX_Reg132, VEX_Reg213, VEX_Reg231;
  uint16_t VEX_Mem132, VEX_Mem213, VEX_Mem231;

  // EVEX opcodes.
  uint16_t EVEX_Reg132, EVEX_Reg213, EVEX_Reg231;
  uint16_t EVEX_Mem132, EVEX_Mem213, EVEX_Mem231;

  // k-masked opcodes.
  uint16_t EVEX_KReg132, EVEX_KReg213, EVEX_KReg231;
  uint16_t EVEX_KMem132, EVEX_KMem213, EVEX_KMem231;

  // k-zero-masked opcodes.
  uint16_t EVEX_KZReg132, EVEX_KZReg213, EVEX_KZReg231;
  uint16_t EVEX_KZMem132, EVEX_KZMem213, EVEX_KZMem231;

  // EVEX.B: Opcodes with explicit round control and with broadcast.
  uint16_t EVEX_ERound132, EVEX_ERound213, EVEX_ERound231;
  uint16_t EVEX_Broadcast132, EVEX_Broadcast213, EVEX_Broadcast231;

  unsigned Attributes; // MVT + {FMA,FMS,FNMA,FNMS,FMADDSUB,FMSUBADD} + IsIntrisic + ...

Also, FMA3Groups could even include *_Int opcodes as well. 
(i.e. uint16_t VEX_Reg132, VEX_Reg213,..., VEX_Reg132_Int, VEX_Reg213_Int,...;)

Making groups bigger would improve the search of the FMA opcodes mentioned in (3) above
(i.e. getFMA213Opcode() method), because there would be quite small number of FMA groups then, and even linear search would probably be Ok.

Thank you,
Vyacheslav Klochkov
`

https://reviews.llvm.org/D23909