[llvm] [AMDGPU] Reschedule loads in clauses to improve throughput (RFC) (PR #102595)

Carl Ritson via llvm-commits llvm-commits at lists.llvm.org
Mon Aug 12 05:47:57 PDT 2024


perlfu wrote:

Memory operation clustering enforces ordering in two ways:
1. reorder while clustering is default off, so no reordering is allowed -- we could change this and enable reordering for loads.
2. the implicit order drives scheduling when other factors are discounted -- e.g. when a sequence of loads depend on a value computed by a VALU immediately beforehand they are all in the stall state, so the implicit order of the cluster tends to determine schedule order.  See the case of v10 in example 2 below.

Below is an example of the output I am seeing.

Default output:
```
  s_clause 0x1f
  image_load v17, v[56:58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v19, [v56, v57, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v27, [v56, v57, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v28, [v56, v57, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v10, [v56, v57, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v9, [v56, v57, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v29, [v56, v57, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v30, [v13, v57, v58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v31, [v13, v57, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v32, [v13, v57, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v33, [v13, v57, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v34, [v13, v57, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v35, [v13, v57, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v36, [v13, v57, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v37, [v56, v26, v58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v38, [v56, v26, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v39, [v56, v26, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v40, [v56, v26, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v41, [v56, v26, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v42, [v56, v26, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v43, [v56, v26, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v44, [v13, v26, v58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v45, [v13, v26, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v46, [v13, v26, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v47, [v13, v26, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v48, [v13, v26, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v49, [v13, v26, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v50, [v13, v26, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v51, [v56, v57, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v52, [v13, v57, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v53, [v56, v26, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v54, [v13, v26, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  s_waitcnt vmcnt(25)
  v_max3_f32 v29, v9, 0, v29
  v_lshlrev_b32_e32 v9, 1, v6
  s_waitcnt vmcnt(18)
  v_max3_f32 v35, v35, 0, v36
  s_waitcnt vmcnt(11)
  v_max3_f32 v36, v42, 0, v43
  s_waitcnt vmcnt(5)
  v_max3_f32 v42, v48, 0, v49
  s_waitcnt vmcnt(3)
  v_max3_f32 v29, v29, v10, v51
  s_waitcnt vmcnt(2)
  v_max3_f32 v34, v35, v34, v52
  s_waitcnt vmcnt(1)
  v_max3_f32 v35, v36, v41, v53
  s_waitcnt vmcnt(0)
  v_max3_f32 v36, v42, v50, v54
  v_mad_u32_u24 v10, 0x90, v7, v55
  v_max3_f32 v17, v29, v17, v19
  v_max3_f32 v29, v34, v30, v31
  v_max3_f32 v30, v35, v37, v38
  v_max3_f32 v31, v36, v44, v45
  v_add_nc_u32_e32 v19, 2, v3
  v_max3_f32 v27, v17, v27, v28
  v_max3_f32 v28, v29, v32, v33
  v_max3_f32 v29, v30, v39, v40
  v_add_nc_u32_e32 v17, 3, v3
  v_max3_f32 v30, v31, v46, v47
  ds_store_2addr_b32 v10, v27, v29 offset1:1
  ds_store_2addr_b32 v10, v28, v30 offset0:18 offset1:19
```

Reorder while clustering enabled:
```
  s_clause 0x1f
  image_load v10, v[55:57], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v9, [v55, v56, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v16, [v55, v56, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v17, [v13, v14, v57], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v27, v[13:15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v28, [v13, v14, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v29, [v13, v56, v57], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v30, [v13, v56, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v31, [v13, v56, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v32, [v13, v56, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v33, [v13, v56, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v34, [v13, v56, v21], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v35, [v13, v56, v19], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v36, [v13, v56, v26], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v37, [v55, v14, v57], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v38, [v55, v14, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v39, [v55, v14, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v40, [v13, v14, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v41, [v55, v56, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v42, [v55, v56, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v43, [v55, v56, v21], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v44, [v55, v14, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v45, [v13, v14, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v46, [v13, v14, v21], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v47, [v55, v14, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v48, [v55, v14, v21], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v49, [v55, v56, v19], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v50, [v55, v56, v26], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v51, [v13, v14, v19], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v52, [v13, v14, v26], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v53, [v55, v14, v19], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v54, [v55, v14, v26], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  s_waitcnt vmcnt(29)
  v_max3_f32 v16, v16, 0, v9
  v_lshlrev_b32_e32 v9, 1, v6
  s_waitcnt vmcnt(26)
  v_max3_f32 v27, v28, 0, v27
  s_waitcnt vmcnt(23)
  v_max3_f32 v30, v31, 0, v30
  v_lshlrev_b32_e32 v31, 3, v6
  s_waitcnt vmcnt(22)
  s_delay_alu instid0(VALU_DEP_2)
  v_max3_f32 v29, v30, v32, v29
  s_waitcnt vmcnt(15)
  v_max3_f32 v28, v39, 0, v38
  s_waitcnt vmcnt(14)
  v_max3_f32 v17, v27, v40, v17
  s_waitcnt vmcnt(13)
  v_max3_f32 v16, v16, v41, v10
  v_mad_u32_u24 v10, 0x90, v7, v31
  s_waitcnt vmcnt(10)
  v_max3_f32 v27, v28, v44, v37
  v_max3_f32 v28, v29, v33, v34
  v_max3_f32 v16, v16, v42, v43
  s_waitcnt vmcnt(8)
  v_max3_f32 v29, v17, v45, v46
  v_add_nc_u32_e32 v17, 2, v3
  s_waitcnt vmcnt(6)
  v_max3_f32 v27, v27, v47, v48
  v_max3_f32 v28, v28, v35, v36
  s_waitcnt vmcnt(4)
  v_max3_f32 v30, v16, v49, v50
  s_waitcnt vmcnt(2)
  v_max3_f32 v29, v29, v51, v52
  v_add_nc_u32_e32 v16, 3, v3
  s_waitcnt vmcnt(0)
  v_max3_f32 v27, v27, v53, v54
  ds_store_2addr_b32 v10, v28, v29 offset1:1
  ds_store_2addr_b32 v10, v30, v27 offset0:18 offset1:19
```

This patch (reorder after clustering+RA):
```
  s_clause 0x1f
  image_load v9, [v56, v57, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v29, [v56, v57, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v35, [v13, v57, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v36, [v13, v57, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v42, [v56, v26, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v43, [v56, v26, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v49, [v13, v26, v15], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v48, [v13, v26, v18], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v10, [v56, v57, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v51, [v56, v57, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v52, [v13, v57, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v34, [v13, v57, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v41, [v56, v26, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v53, [v56, v26, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v50, [v13, v26, v20], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v54, [v13, v26, v14], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v19, [v56, v57, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v17, v[56:58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v31, [v13, v57, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v30, [v13, v57, v58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v38, [v56, v26, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v37, [v56, v26, v58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v44, [v13, v26, v58], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v45, [v13, v26, v22], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v28, [v56, v57, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v27, [v56, v57, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v33, [v13, v57, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v32, [v13, v57, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v40, [v56, v26, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v39, [v56, v26, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v46, [v13, v26, v23], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  image_load v47, [v13, v26, v24], s[4:11] dmask:0x1 dim:SQ_RSRC_IMG_2D_MSAA unorm
  s_waitcnt vmcnt(30)
  v_max3_f32 v29, v9, 0, v29
  v_lshlrev_b32_e32 v9, 1, v6
  s_waitcnt vmcnt(28)
  v_max3_f32 v35, v35, 0, v36
  s_waitcnt vmcnt(26)
  v_max3_f32 v36, v42, 0, v43
  s_waitcnt vmcnt(24)
  v_max3_f32 v42, v48, 0, v49
  s_waitcnt vmcnt(22)
  v_max3_f32 v29, v29, v10, v51
  v_mad_u32_u24 v10, 0x90, v7, v55
  s_waitcnt vmcnt(20)
  v_max3_f32 v34, v35, v34, v52
  s_waitcnt vmcnt(18)
  v_max3_f32 v35, v36, v41, v53
  s_waitcnt vmcnt(16)
  v_max3_f32 v36, v42, v50, v54
  s_waitcnt vmcnt(14)
  v_max3_f32 v17, v29, v17, v19
  v_add_nc_u32_e32 v19, 2, v3
  s_waitcnt vmcnt(12)
  v_max3_f32 v29, v34, v30, v31
  s_waitcnt vmcnt(10)
  v_max3_f32 v30, v35, v37, v38
  s_waitcnt vmcnt(8)
  v_max3_f32 v31, v36, v44, v45
  s_waitcnt vmcnt(6)
  v_max3_f32 v27, v17, v27, v28
  v_add_nc_u32_e32 v17, 3, v3
  s_waitcnt vmcnt(4)
  v_max3_f32 v28, v29, v32, v33
  s_waitcnt vmcnt(2)
  v_max3_f32 v29, v30, v39, v40
  s_waitcnt vmcnt(0)
  v_max3_f32 v30, v31, v46, v47
  ds_store_2addr_b32 v10, v27, v29 offset1:1
  ds_store_2addr_b32 v10, v28, v30 offset0:18 offset1:19
```



https://github.com/llvm/llvm-project/pull/102595


More information about the llvm-commits mailing list