<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/63731>63731</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            optimizing division-by-constant on AArch32(ARM32)

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          rsaxvc

      </td>

    </tr>

</table>

<pre>

    I found both GCC12 and Clang11 call __aeabi_uldivmod when dividing a uint64_t by a constant on AArch32, and I found it's possible to emulate the AArch64 umulh instruction which enables division by a constant optimization even on AArch32. [I wrote this up here](https://rsaxvc.net/blog/2022/7/21/Optimized_division_of_uint64_t_by_a_constant_on_32-bit_microcontrollers.html) and will put the relevant bits below, there's [a GCC bugzilla suggestion](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484) as well. This optimization applies only to Arm cores with umull instruction, which is many of them but not present on the smallest microcontrollers and some older cores.

C code for dividing uint64_t by a constant

```

uint64_t div3(uint64_t x)

{

    return x/3;

}

```

Clang11( -O3 -mcpu=cortex-m4):

```

div3(unsigned long long):

        push {r7, lr}

        movs    r2, #3

        movs    r3, #0

        bl __aeabi_uldivmod

        pop     {r7, pc}

```

However, Clang16 for AArch64 implements division by 3 using multiplication by the inverse as:

```

div3(unsigned long):                               // @div3(unsigned long)

        mov     x8, #-6148914691236517206

 movk    x8, #43691

        umulh   x8, x0, x8

        lsr     x0, x8, #1

        ret

```

While AArch32 instruction set like Cortex-M4 does not have the umulth instruction, it does have umull(32bit x 32bit multiply with 64-bit result). Using that, we can emulate umulh with the following function:

```

uint64_t umulh( uint64_t a, uint64_t b )

{

const uint32_t a_lo = a;

const uint32_t a_hi = a >> 32;

const uint32_t b_lo = b;

const uint32_t b_hi = b >> 32;

/* FOIL method of multiplication

See https://en.wikipedia.org/wiki/FOIL_method,

but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy + d),

consider it with variables a,b,c,d, constants x,y = 1<<32 

When applicable each is an ARM UMULL instruction, might do better with UMLAL*/

const uint64_t acc0 = (uint64_t)a_lo * b_lo;

const uint64_t acc1 = (uint64_t)a_hi * b_lo;

const uint64_t acc2 = (uint64_t)a_lo * b_hi;

const uint64_t acc3 = (uint64_t)a_hi * b_hi;

/* Break up into 32-bit values */

const uint32_t lo0 = acc0;

const uint32_t hi0 = acc0 >> 32;

const uint32_t lo1 = acc1;

const uint32_t hi1 = acc1 >> 32;

const uint32_t lo2 = acc2;

const uint32_t hi2 = acc2 >> 32;

const uint32_t lo3 = acc3;

const uint32_t hi3 = acc3 >> 32;

/* The first 32 bits aren't used in the result,

no need to store them, so no need to compute them.

In fact, don't even worry about res0, start computing res1*/

uint64_t acc = 0;

const uint32_t res0 = lo0;

acc += hi0;

acc += lo1;

acc += lo2;

const uint32_t res1 = acc;

acc >>= 32;

acc += hi1;

acc += hi2;

acc += lo3;

const uint32_t res2 = (uint32_t)acc;

acc >>= 32;

acc += hi3;

const uint32_t res3 = (uint32_t)acc;

return (((uint64_t)res3) << 32 ) + res2;

}

```

I will refer to this as umulh emulation, and in testing it results in a speedup, between 2x and 37x, depending on various factors, on Cortex-M4.

Firstly, the execution time of umulh() is a constant on Cortex-M4, but __aeabi_uldivmod() execution time depends on the numerator.

If the core and compiler-support library implement __aeabi_uldivmod() using the udiv instruction, the speedup for emulating umulh is relatively small, only around 2-4x on Cortex-M4. But if not, the umulh approach can divide numbers at about 28-37x the rate of __aeabi_uldivmod(). Roughly we can group cores as follows:

* Cores where umulh emulation performs poorly due to lack of umull instruction: old Arm before ARM7TDMI, Cortex M0/0+/1/23. On these we should use __aeabi_uldivmod()

* Cores where umulh emulation performs an order of magnitude faster due to presence of umull instruction but lack of udiv(or __aeabi_uldivmod does not take advantage of udiv): [ARM7TDMI](https://developer.arm.com/documentation/ddi0210/c/Instruction-Cycle-Timings/Multiply-and-multiply-accumulate), ARM9, ARM10, ARM11, Cortex-A8, and Cortex-A9.

* Cores where umulh emulation is only a little faster: Cortex-A7, Cortex-A15, 64-bit Arm cores.

If this is of interest, it might be worth putting umulh into the compiler support library and using it to divide uint64_t by constants.

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJycWMlu4zwSfhrmUrAhUY6XQw6283smQAcN9HRjjgYllS1OKFIgKS__0w-KlLzFTvdMEFi2WBtr-VhF4ZzcasQX9rxgz69PovWVsS_WicOueMpNeXx5g41pdQm58RX8Y7lMOQhdwlIJvU1TKIRSsF4LFLlct6qUu9qUsK9QQyl3spR6CwJaqf14tPaQH0FAYbTzQnswGuZzW1QZZ3wZxPbapGd84qAxzslcIXgDWLdKeARfYeQaj6CtW1WB1M7btvDSaNhXsqgAtcgVumCCo9c3ehsva_m3CBy4Q31hyRDY8-IN9tYEXdJB20CFFtnzK-PTyvvGsWzO-IrxVXTUUKNnfJUrs2V8xRPOGV9N6GvK-Op7VIblurdmbTbr3iPr_LgW696ytdHrjA9y6de1LKwpjPbWKIXWDStfK8ZnwU17qRQ0rQ_OsKhwR9vKpXeQozJ7cqcPRpMX2fNCUOggb7d_S6UEuHa7RUfbv7erbVEMt7odGkv76ZkYX7nK7Nd5ux0WW8mylSxZ9pom49F0FAxzsEelhvCTvHblY9E0SqIDo9WRYjm3NRTGooO99FUIo7oMI9kfIykd1EIfwWxoQzXkrQdtPDQWHcYUIh-4WiiFzsOt24K7nKkRjCrRRq1DlryyZB4_l1CYEmFj7Dlj7-frJRcbJ91_-HliKOUuY3x6-n1gfNZxTBbxCwCARd9aTaurjGWLnuL1rvDOzlhxjE9h8D2DQV00LcteC2M9HgY1hYAi-NjG3jQdar4EZfQ2fFxyQvfXtK4CNlnYCcVC2ZNtPUFtdi7sJNQu41n2YD3r1pPr9fwzbNxYYJrwPBnRFF876J9mjzu0RBp9NQ5B7bFC1o3CGrW_hoUMWkchr1vlZaNkETM2P4a8knqH1iEId_bt770aHQpf_8VaAzZKHom49Wd4HqadPwfjdDSdpaPxLOXZ-Dmd8GTcsdRm93FFO8rGs_RaXoTOnuaQhM_pNY1yNursV6O0G0kW_RdR-XclFfbgegXVDj0o-YGwjCn8PoLSoAvlXYldRHoy01e32CB9JA1kAT0Yn2Y8lx4OEJ9dPI8RYcYjAlWw6FrlGZ8N4VeIuq-ED2CDUAh9OmOicwInGbExSpk90W9aHY34stJO1R_kUMWe3gjSdoYX-AwPAW0CScaJYa0MsOwVxAknPlFUMlIAy_5i2V-Q8Ue0eS8tf0zRScvvSOs-KXPnsPr-9g1q9JUpCZ6vCyhS_gsRrs8W1MO9_JANllJ0Bwz9ZnxF4tZRHOPLyE9oT6FHEVTkUptaCtUdGz0uu-DVnPFlwfiyhJ2wMp7_B8aXRypFxqfiwPgip4OKjGd8WhyB8QWUFIJeIYmUdE5IH3WcZV3poCie1Qc1wWkpy5YsW2Yc-uzH7vQrSAqgiGea0DD_8Q6_3n99-3ab3LXcVpTfkKP3aKMdv96_zb8xHrx4E7aYV0WRBAsuDh_GZzF7-DwE_k7Ie970Li9lwu95-dd6K_kFb_a13gveq9RbWBQf1JhJ7Q3Ejgl2QrXo4L6TQm4rE31EznqU_5U80_y-npRJe-r0scQzzZ9I5D31Q5pKnmn-RGLWU2ePJZ5pflf3PwkRpXUeMh57TmFRMz7x0DosQequLY1Y21WWNqARS-r-nDc2oHtN6e4MXKwVpm7a2OXXXZv2pmEjigDTpYl6Qs--N9YeQeSmDcAezijnhfWdEMJriy69TIjL3AsbfpgHJDFQKJPc-CHw8gUtVvK8ePFamfT-64dBIkP7CNxTF0JC6xdRubLjrsJK3qVWJrvR8dmaq6qml1SZ_6dtD9POosv-SE_XLzM-jf8XcEEyAqoH5KWcjBi_CLv4w-b6LY5UFjdoKQ3D4Cdc1wjEtqCDZxooKMVpftJbOHUVjt4KcA1i2TZEmaPfI2rgh8CUTQ4hh7FBHaYMo8PxYloXEtxYR-tGnxuiq0FlRTWnjt1sB3jAog2NlJc03mxO3Qbtn8y_GrRPMoNlrf_cfUfGG7HRWtfPWbqt0Qpv7JVlb2E4C7NV2CnVn1RoB65tGmOpzcutsMdzD_5Ie9t1ZQhtKXe3J2OY9KJ_Q2vfxYUGtngR4GgYFl7uUB3jTBg9qo4gbLhY4IPR4drFsKAmY0NtZ68jShNNYw2d19QZhtEw7D8PU6XvgIdPB9nkEAGPOkezebC1Ifww7baihjT2mltr2qabgoXrekx321jyOZlKgzKN87f5CA3ajbG1g8YYq45QtuGuRInio8-I68E6m9MgHAbwHDcUsPmP98nP1_e3MDcFr8B7wvgqofLlq5TxFc-G8D0kgEMy31WmVSWh_YPN_o_GCw3GUtdFbaTYaulbmsiFo_an21Ic-Au8u62Q0KdNl3LH-NTYz_dSp_nCiw8EUe6E9mKLZ64wt7Hnxckpd65HStyhMg3aobD1sDA1vTNFS3ndocSqLGXCU3Jjwfjq7WzpYHksFA5-ylrqrWN89d7NKQOhy0F9-lEUbZxFYndKUeqfadJ_Sc8xG8ynPTj1L2bDPwuD7O5lBCjpveodT57oRU0uFaXP9KubqE43OZ8BQbogekNNGlp0vpvcYn-bIx3gvoKm9ZclrAP64glD4BZDaIsRJqSnxOgq8_LK5tSZD5_Kl6ycZTPxhC_peDqd8CRJxk_Vy4RnxSzfPItyjOkGkzIRSZlOJpnAzXgqiif5whOeJZNkkkyfMz4e8s1s9pzPiuJ5xFOeCjZKsBZSDZXa1TTKPEnnWnwZZ5MsfVIiR-X6e1X7QkSDvN06NkqUdN6d2bz0Cl-6-zLaV389MciPg7t3pdP5j3d6zp5aq15uLu-kr9q8S0vS0T0GjTX_QWqjVsFOyr1g6n8DAAD__5FD0EI">