[PATCH] D27682: AMDGPU: Add replacement export intrinsics

Tue Dec 13 06:55:39 PST 2016

mareko added inline comments.

================
Comment at: include/llvm/IR/IntrinsicsAMDGPU.td:451
+  llvm_float_ty,  // src2
+  llvm_float_ty,  // src2
+  llvm_i1_ty,     // done
----------------
// src3

================
Comment at: include/llvm/IR/IntrinsicsAMDGPU.td:463
+  llvm_v2f16_ty, // src0
+  llvm_v2f16_ty, // src1
+  llvm_i1_ty,    // done
----------------
v2f16 isn't the best choice here. The compressed export can be used with one of these types:
- v2f16
- v2i16
- v2u16

Using i32 would be better, because the last 2 are packed as i32 anyway. It really depends on the output type of packing instructions. This is the complete list of instructions we should be using for compressed exports:
- v_cvt_pkrtz_f16_f32
- v_cvt_pknorm_u16_f32
- v_cvt_pknorm_i16_f32
- v_cvt_pk_u16_u32
- v_cvt_pk_i16_i32

================
Comment at: include/llvm/IR/IntrinsicsAMDGPU.td:466
+  llvm_i1_ty],   // vm
+  [IntrInaccessibleMemOnly]
+>;
----------------
While IntrInaccessibleMemOnly makes sense for EXP in theory, in practice we might need something more limiting, because the first executed EXP instruction limits parallelism and therefore reduces the ability to hide latencies (the first EXP triggers EXP_ALLOC and if there is not enough EXP memory, the wave has to wait), so we don't want to move the first EXP across any load or store that's above it.

Depending on the chip and other parameters, EXP_ALLOC is sometimes done at wave launch, in which case the EXP scheduling doesn't matter. These are the only cases where EXP_ALLOC is done at wave launch:
- SI: all vertex shaders (not configurable)
- CIK-VI: all vertex shaders if the number of good CUs is <= 4 (e.g. Kabini, Mullins, Stoney, some Kaveri chips, Carrizo B4), it's configurable via a context register.

https://reviews.llvm.org/D27682