[llvm-dev] [cfe-dev] Portable multiplication 64 x 64 -> 128 for int128 reimplementation

Roman Lebedev via llvm-dev llvm-dev at lists.llvm.org
Sun Dec 30 14:00:02 PST 2018


On Mon, Dec 31, 2018 at 12:46 AM Paweł Bylica via cfe-dev
<cfe-dev at lists.llvm.org> wrote:
>
> Hi Arthur, Craig,
>
> Thanks for you comments about GCC/Clang intrinsics. I never considered using them, but they might be better alternative to inline assembly.
> Is there a one for regular MUL?
>
> Anyway, I want to go the opposite direction. If I can I relay on compiler's optimizations. If I want to use MULX in Clang I do it like that:
>
> unsigned long mulx(unsigned long x, unsigned long y, unsigned long* hi)
> {
> auto p = (unsigned __int128){x} * y;
> *hi = static_cast<unsigned long>(p >> 64);
> return static_cast<unsigned long>(p);
> }
>
> https://godbolt.org/z/PbgFb9
>
> If compiled with -mbmi2 -mtune=generic it just uses MULX instruction.
>
> mulx(unsigned long, unsigned long, unsigned long*):
> mov rcx, rdx
> mov rdx, rsi
> mulx rdx, rax, rdi
> mov qword ptr [rcx], rdx
> ret
>
> What I want to do it move it further - rewrite the above mulx() helper without using __int128 type in a way that a compiler would recognize that it should use MUL/MULX instruction.
>
> A possible implementation looks like
>
>
> uint64_t mul_full_64_generic(uint64_t x, uint64_t y, uint64_t* hi)
> {
> uint64_t xl = x & 0xffffffff;
> uint64_t xh = x >> 32;
> uint64_t yl = y & 0xffffffff;
> uint64_t yh = y >> 32;
>
> uint64_t t = xl * yl;
> uint64_t l = t & 0xffffffff;
> uint64_t h = t >> 32;
>
> t = xh * yl;
> t += h;
> h = t >> 32;
>
> t = xl * yh + (t & 0xffffffff);
> l |= t << 32;
> *hi = xh * yh + h + (t >> 32);
> return l;
> }
>
> As expected, Clang is not able to match this pattern currently.
>
> If we want to implement this optimization in Clang, there are some questions I have:
> 3. Can this pattern be split into smaller ones? E.g. UMULH.
The transform should be done on LLVM IR.

> 2. What pass this optimization should be added to?
Compare:
https://godbolt.org/z/Blx1tp
basically, you need to match&transform the latter LLVM IR into the former.

Where?
I'm not sure there is a dedicated place for such transforms.
I would guess AggressiveInstCombine, since that pattern is somewhat large.

> 1. Can we prove this pattern is equivalent of MUL 64x64 -> 128?
You can, by feeding both of these IR's into https://github.com/nunoplopes/alive

> Paweł
Roman.

> On Sun, Dec 30, 2018 at 2:34 AM Craig Topper <craig.topper at gmail.com> wrote:
>>
>> _mulx_u64 only exists when the target is x86_64. That's still not very portable. I'm not opposed to removing the bmi2 check, but gcc also has the same check so it doesn't improve portability much.
>>
>> ~Craig
>>
>>
>> On Sat, Dec 29, 2018 at 4:44 PM Arthur O'Dwyer via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>>
>>> Hi Pawel,
>>>
>>> There is the _mulx_u64 intrinsic, but it currently requires the hardware flag "-mbmi2".
>>> https://github.com/Quuxplusone/WideIntProofOfConcept/blob/master/wider.h#L89-L99
>>>
>>> On Clang 3.8.1 and earlier, the _addcarry_u64 and _subborrow_u64 intrinsics required the hardware flag `-madx`, even though they didn't use the hardware ADX/ADOX instructions. Modern GCC and Clang permit the use of these intrinsics (to generate ADC) even in the absence of `-madx`.
>>>
>>> I think it would be a very good idea for Clang to support _mulx_u64 (to generate MUL) even in the absence of `-mbmi2`.
>>>
>>> –Arthur
>>>
>>>
>>> On Sat, Dec 29, 2018 at 6:03 PM Paweł Bylica via cfe-dev <cfe-dev at lists.llvm.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> For some maybe dumb reasons I try to write a portable version of int128.
>>>>
>>>> What is very valuable for this implementation is access to MUL instruction on x86 which provides full 64 x 64 -> 128 bit multiplication. An equally useful on ARM would be UMULH instruction.
>>>>
>>>> Well, the way you can access this on clang / GCC is to use __int128 type or use inline assembly. MSVC provides an intrinsic for this instruction. This defeats the idea of portable int128 reimplementation and makes constexpr implementation of multiplication at least inconvenient.
>>>>
>>>> Maybe there is a hope for me in LLVM. Is there any pattern matcher that is producing MUL instruction of bigger type?
>>>> If not, would it be good idea to teach LLVM about it?
>>>>
>>>> Bests,
>>>> Paweł
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> cfe-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


More information about the llvm-dev mailing list