r215614 - CodeGen: When bitfields fall on natural boundaries, split them up

Wed Aug 13 21:31:54 PDT 2014

Chandler Carruth <chandlerc at google.com> writes:
>     CodeGen: When bitfields fall on natural boundaries, split them up
>
> Please revert. We used to do this, and I worked very hard to specifically
> design the system to work the way it does now. There are copious comments,
> email threads, and discussions where we debated how to handle this and decided
> to do so in the exact way.
>
> If you want to re-open the discussion, cool. But please do so as a discussion,
> and cite specifically why the rationale for using fully-wide loads and stores
> is inapplicable or causes insurmountable problems.

Hm, I hadn't really realized the history on this. I guess you're talking
about r169489, "Rework the bitfield access IR generation". Looks to me
like we really did a lot more than just break bitfields up on natural
boundaries.

Note that my change here doesn't change the way we lay the bitfields out
*at all*, and that all of the tests added in r169489 pass unchanged with
it. I can understand the argument against extra complexity though: I'll
get back to that in a minute.

> On Wed, Aug 13, 2014 at 7:42 PM, Justin Bogner <mail at justinbogner.com> wrote:
>
>     This leads to bad code being generated in most of our 32 bit backends.
>
> Also, cite your source. I tested this on ARM and x86 when I changed bitfields
> to work the way they did before your patch and we generated very good code.

I was looking into some poor code generation on something like this:

    typedef struct
    {
      unsigned int lsb32   : 32;
      unsigned int msb7    : 7;
      unsigned int padding : 25;
    } P;

    void foo(P *p) {
      p->lsb32 = 0x0000;
    }

On armv7, this was giving pretty poor code. At -O0:

        push    {r4, lr}
        sub     sp, sp, #8
        mov     r1, r0
        str     r0, [sp, #4]
        ldrb    r2, [r0, #5]
        mov     r3, r0
        ldrb    r12, [r3, #4]!
        ldrb    lr, [r3, #2]
        ldrb    r4, [r3, #3]
        strb    r4, [r3, #3]
        strb    lr, [r3, #2]
        strb    r2, [r0, #5]
        mov     r2, #0
        strb    r2, [r0, #3]
        strb    r2, [r0, #2]
        strb    r2, [r0, #1]
        strb    r2, [r0]
        strb    r12, [r3]
        str     r1, [sp]                @ 4-byte Spill
        add     sp, sp, #8
        pop     {r4, pc}

and at -O3:

        push    {lr}
        mov     r2, r0
        ldrb    lr, [r0, #5]
        ldrb    r12, [r2, #4]!
        ldrb    r1, [r2, #3]
        ldrb    r3, [r2, #2]
        strb    r1, [r2, #3]
        mov     r1, #0
        strb    r3, [r2, #2]
        strb    lr, [r0, #5]
        strb    r1, [r0, #3]
        strb    r1, [r0, #2]
        strb    r1, [r0, #1]
        strb    r1, [r0]
        strb    r12, [r2]
        pop     {lr}
        bx      lr

Where with the change I've made here, we have -O0:

        sub     sp, sp, #8
        mov     r1, r0
        str     r0, [sp, #4]
        mov     r2, #0
        strb    r2, [r0, #3]
        strb    r2, [r0, #2]
        strb    r2, [r0, #1]
        strb    r2, [r0]
        str     r1, [sp]                @ 4-byte Spill
        add     sp, sp, #8
        bx      lr

and -O3:

        mov     r1, #0
        strb    r1, [r0, #3]
        strb    r1, [r0, #2]
        strb    r1, [r0, #1]
        strb    r1, [r0]
        bx      lr

The code that's being generated is pretty awful, and it just gets worse
if you add more to the bitfield.

I'd briefly looked at i386 as well where I saw some extra instructions
at -O0 and thought things were the same. Looks like the -O3 code is
identical there though, so my claim about most of our 32 bit backends
was a mistake. Sorry about that.

So I guess there are two ways to handle this. The change I've made,
which adds a small amount of complexity to the frontend in order to
avoid generating IR that does a lot of redundant work, or teaching the
arm backend that a read-modify-write that masks out most of the bits
should be optimized to a narrower store. I haven't looked into how hard
that option is.

Do you think the backend approach will be simpler (or perhaps valuable
enough in other situations that it's worth its cost)? If so, I'll revert
this and try that approach out.