[llvm] [AMDGPU] Reschedule loads in clauses to improve throughput (RFC) (PR #102595)
Carl Ritson via llvm-commits
llvm-commits at lists.llvm.org
Mon Aug 12 05:47:57 PDT 2024
perlfu wrote:
Memory operation clustering enforces ordering in two ways:
1. reorder while clustering is default off, so no reordering is allowed -- we could change this and enable reordering for loads.
2. the implicit order drives scheduling when other factors are discounted -- e.g. when a sequence of loads depend on a value computed by a VALU immediately beforehand they are all in the stall state, so the implicit order of the cluster tends to determine schedule order. See the case of v10 in example 2 below.
Below is an example of the output I am seeing.
Default output:
```
s_clause 0x1f
image_load v17, v[56:58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v19, [v56, v57, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v27, [v56, v57, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v28, [v56, v57, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v10, [v56, v57, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v9, [v56, v57, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v29, [v56, v57, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v30, [v13, v57, v58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v31, [v13, v57, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v32, [v13, v57, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v33, [v13, v57, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v34, [v13, v57, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v35, [v13, v57, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v36, [v13, v57, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v37, [v56, v26, v58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v38, [v56, v26, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v39, [v56, v26, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v40, [v56, v26, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v41, [v56, v26, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v42, [v56, v26, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v43, [v56, v26, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v44, [v13, v26, v58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v45, [v13, v26, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v46, [v13, v26, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v47, [v13, v26, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v48, [v13, v26, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v49, [v13, v26, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v50, [v13, v26, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v51, [v56, v57, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v52, [v13, v57, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v53, [v56, v26, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v54, [v13, v26, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
s_waitcnt vmcnt(25)
v_max3_f32 v29, v9, 0, v29
v_lshlrev_b32_e32 v9, 1, v6
s_waitcnt vmcnt(18)
v_max3_f32 v35, v35, 0, v36
s_waitcnt vmcnt(11)
v_max3_f32 v36, v42, 0, v43
s_waitcnt vmcnt(5)
v_max3_f32 v42, v48, 0, v49
s_waitcnt vmcnt(3)
v_max3_f32 v29, v29, v10, v51
s_waitcnt vmcnt(2)
v_max3_f32 v34, v35, v34, v52
s_waitcnt vmcnt(1)
v_max3_f32 v35, v36, v41, v53
s_waitcnt vmcnt(0)
v_max3_f32 v36, v42, v50, v54
v_mad_u32_u24 v10, 0x90, v7, v55
v_max3_f32 v17, v29, v17, v19
v_max3_f32 v29, v34, v30, v31
v_max3_f32 v30, v35, v37, v38
v_max3_f32 v31, v36, v44, v45
v_add_nc_u32_e32 v19, 2, v3
v_max3_f32 v27, v17, v27, v28
v_max3_f32 v28, v29, v32, v33
v_max3_f32 v29, v30, v39, v40
v_add_nc_u32_e32 v17, 3, v3
v_max3_f32 v30, v31, v46, v47
ds_store_2addr_b32 v10, v27, v29 offset1:1
ds_store_2addr_b32 v10, v28, v30 offset0:18 offset1:19
```
Reorder while clustering enabled:
```
s_clause 0x1f
image_load v10, v[55:57], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v9, [v55, v56, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v16, [v55, v56, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v17, [v13, v14, v57], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v27, v[13:15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v28, [v13, v14, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v29, [v13, v56, v57], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v30, [v13, v56, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v31, [v13, v56, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v32, [v13, v56, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v33, [v13, v56, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v34, [v13, v56, v21], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v35, [v13, v56, v19], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v36, [v13, v56, v26], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v37, [v55, v14, v57], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v38, [v55, v14, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v39, [v55, v14, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v40, [v13, v14, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v41, [v55, v56, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v42, [v55, v56, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v43, [v55, v56, v21], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v44, [v55, v14, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v45, [v13, v14, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v46, [v13, v14, v21], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v47, [v55, v14, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v48, [v55, v14, v21], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v49, [v55, v56, v19], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v50, [v55, v56, v26], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v51, [v13, v14, v19], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v52, [v13, v14, v26], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v53, [v55, v14, v19], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v54, [v55, v14, v26], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
s_waitcnt vmcnt(29)
v_max3_f32 v16, v16, 0, v9
v_lshlrev_b32_e32 v9, 1, v6
s_waitcnt vmcnt(26)
v_max3_f32 v27, v28, 0, v27
s_waitcnt vmcnt(23)
v_max3_f32 v30, v31, 0, v30
v_lshlrev_b32_e32 v31, 3, v6
s_waitcnt vmcnt(22)
s_delay_alu instid0(VALU_DEP_2)
v_max3_f32 v29, v30, v32, v29
s_waitcnt vmcnt(15)
v_max3_f32 v28, v39, 0, v38
s_waitcnt vmcnt(14)
v_max3_f32 v17, v27, v40, v17
s_waitcnt vmcnt(13)
v_max3_f32 v16, v16, v41, v10
v_mad_u32_u24 v10, 0x90, v7, v31
s_waitcnt vmcnt(10)
v_max3_f32 v27, v28, v44, v37
v_max3_f32 v28, v29, v33, v34
v_max3_f32 v16, v16, v42, v43
s_waitcnt vmcnt(8)
v_max3_f32 v29, v17, v45, v46
v_add_nc_u32_e32 v17, 2, v3
s_waitcnt vmcnt(6)
v_max3_f32 v27, v27, v47, v48
v_max3_f32 v28, v28, v35, v36
s_waitcnt vmcnt(4)
v_max3_f32 v30, v16, v49, v50
s_waitcnt vmcnt(2)
v_max3_f32 v29, v29, v51, v52
v_add_nc_u32_e32 v16, 3, v3
s_waitcnt vmcnt(0)
v_max3_f32 v27, v27, v53, v54
ds_store_2addr_b32 v10, v28, v29 offset1:1
ds_store_2addr_b32 v10, v30, v27 offset0:18 offset1:19
```
This patch (reorder after clustering+RA):
```
s_clause 0x1f
image_load v9, [v56, v57, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v29, [v56, v57, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v35, [v13, v57, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v36, [v13, v57, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v42, [v56, v26, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v43, [v56, v26, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v49, [v13, v26, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v48, [v13, v26, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v10, [v56, v57, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v51, [v56, v57, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v52, [v13, v57, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v34, [v13, v57, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v41, [v56, v26, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v53, [v56, v26, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v50, [v13, v26, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v54, [v13, v26, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v19, [v56, v57, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v17, v[56:58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v31, [v13, v57, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v30, [v13, v57, v58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v38, [v56, v26, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v37, [v56, v26, v58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v44, [v13, v26, v58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v45, [v13, v26, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v28, [v56, v57, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v27, [v56, v57, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v33, [v13, v57, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v32, [v13, v57, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v40, [v56, v26, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v39, [v56, v26, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v46, [v13, v26, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
image_load v47, [v13, v26, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
s_waitcnt vmcnt(30)
v_max3_f32 v29, v9, 0, v29
v_lshlrev_b32_e32 v9, 1, v6
s_waitcnt vmcnt(28)
v_max3_f32 v35, v35, 0, v36
s_waitcnt vmcnt(26)
v_max3_f32 v36, v42, 0, v43
s_waitcnt vmcnt(24)
v_max3_f32 v42, v48, 0, v49
s_waitcnt vmcnt(22)
v_max3_f32 v29, v29, v10, v51
v_mad_u32_u24 v10, 0x90, v7, v55
s_waitcnt vmcnt(20)
v_max3_f32 v34, v35, v34, v52
s_waitcnt vmcnt(18)
v_max3_f32 v35, v36, v41, v53
s_waitcnt vmcnt(16)
v_max3_f32 v36, v42, v50, v54
s_waitcnt vmcnt(14)
v_max3_f32 v17, v29, v17, v19
v_add_nc_u32_e32 v19, 2, v3
s_waitcnt vmcnt(12)
v_max3_f32 v29, v34, v30, v31
s_waitcnt vmcnt(10)
v_max3_f32 v30, v35, v37, v38
s_waitcnt vmcnt(8)
v_max3_f32 v31, v36, v44, v45
s_waitcnt vmcnt(6)
v_max3_f32 v27, v17, v27, v28
v_add_nc_u32_e32 v17, 3, v3
s_waitcnt vmcnt(4)
v_max3_f32 v28, v29, v32, v33
s_waitcnt vmcnt(2)
v_max3_f32 v29, v30, v39, v40
s_waitcnt vmcnt(0)
v_max3_f32 v30, v31, v46, v47
ds_store_2addr_b32 v10, v27, v29 offset1:1
ds_store_2addr_b32 v10, v28, v30 offset0:18 offset1:19
```
https://github.com/llvm/llvm-project/pull/102595
More information about the llvm-commits
mailing list