[PATCH] D28637: [PPC] Inline expansion of memcmp

Thu Mar 9 12:25:04 PST 2017

On 3/9/2017 11:06 AM, Zaara Syeda via Phabricator wrote:
> syzaara added inline comments.
>
>
> ================
> Comment at: lib/CodeGen/CodeGenPrepare.cpp:2233
> +    Builder.Insert(NewBr);
> +    return;
> +  }
> ----------------
> efriedma wrote:
>> syzaara wrote:
>>> efriedma wrote:
>>>> Hmm... in isOnlyUsedInZeroEqualityComparison mode, it probably makes sense to perform more loads per block, to reduce the number of branches.
>>> I'm not sure what you mean here. We have the branches for doing an early exit so we don't need to continue loading more bytes when we find a difference.
>> A couple loads hitting the cache are cheaper than a branch in cases where the branch isn't predictable... but maybe not a big deal either way.
> I'm still not sure how this reduces the number of branches. We would still need to compare each of the loads to see which one had the difference.

Branchless 16-byte memcmp isOnlyUsedInZeroEqualityComparison:

_Bool memcmp16(__uint128_t *Src1, __uint128_t *Src2) {
   return *Src1 == *Src2;
}

Resulting code on x86-64:

         movq    (%rdi), %rax
         movq    8(%rdi), %rcx
         xorq    8(%rsi), %rcx
         xorq    (%rsi), %rax
         orq     %rcx, %rax
         sete   %al
         retq

> ================
> Comment at: lib/CodeGen/CodeGenPrepare.cpp:2269
> +  Value *Subtract = Builder.CreateSub(And1, And2);
> +  Value *Res = Builder.CreateSExtOrTrunc(Subtract, Builder.getInt32Ty());
> +
> ----------------
> efriedma wrote:
>> Would `bswap(Src1) > bswap(Src2) ? 1 : -1` be shorter?
> On power, having a load followed by a bswap will get converted to a load byte reversed hardware instruction. But I'm not sure if other targets will have a good way to handle bswap within gprs. It could introduce a longer sequence. Also this way, we are adding more branches with the comparison.

Just worry about which lowering is best one target at a time; we can 
always add a target hook later if we need a different lowering for other 
targets.  And the "? 1 : -1" should get lowered to some branchless sequence.

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project