[PATCH] D27682: AMDGPU: Add replacement export intrinsics
Marek Olšák via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Dec 13 06:55:39 PST 2016
mareko added inline comments.
================
Comment at: include/llvm/IR/IntrinsicsAMDGPU.td:451
+ llvm_float_ty, // src2
+ llvm_float_ty, // src2
+ llvm_i1_ty, // done
----------------
// src3
================
Comment at: include/llvm/IR/IntrinsicsAMDGPU.td:463
+ llvm_v2f16_ty, // src0
+ llvm_v2f16_ty, // src1
+ llvm_i1_ty, // done
----------------
v2f16 isn't the best choice here. The compressed export can be used with one of these types:
- v2f16
- v2i16
- v2u16
Using i32 would be better, because the last 2 are packed as i32 anyway. It really depends on the output type of packing instructions. This is the complete list of instructions we should be using for compressed exports:
- v_cvt_pkrtz_f16_f32
- v_cvt_pknorm_u16_f32
- v_cvt_pknorm_i16_f32
- v_cvt_pk_u16_u32
- v_cvt_pk_i16_i32
================
Comment at: include/llvm/IR/IntrinsicsAMDGPU.td:466
+ llvm_i1_ty], // vm
+ [IntrInaccessibleMemOnly]
+>;
----------------
While IntrInaccessibleMemOnly makes sense for EXP in theory, in practice we might need something more limiting, because the first executed EXP instruction limits parallelism and therefore reduces the ability to hide latencies (the first EXP triggers EXP_ALLOC and if there is not enough EXP memory, the wave has to wait), so we don't want to move the first EXP across any load or store that's above it.
Depending on the chip and other parameters, EXP_ALLOC is sometimes done at wave launch, in which case the EXP scheduling doesn't matter. These are the only cases where EXP_ALLOC is done at wave launch:
- SI: all vertex shaders (not configurable)
- CIK-VI: all vertex shaders if the number of good CUs is <= 4 (e.g. Kabini, Mullins, Stoney, some Kaveri chips, Carrizo B4), it's configurable via a context register.
https://reviews.llvm.org/D27682
More information about the llvm-commits
mailing list