<div dir="ltr">Hi,<br><br>I'm working on an optimization to improve LoopIdiomRecognize pass. For a trivial loop like this:<br><br>```<br>struct S {<br>  int a;<br>  int b;<br>  char c;<br>  // 3 bytes padding<br>};<br><br>unsigned copy_noalias(S* __restrict__ a, S* b, int n) {<br>  for (int i = 0; i < n; i++) {<br>    a[i] = b[i];<br>  }<br>  return sizeof(a[0]);<br>}<br>```<br><br>Clang generates the below loop (some parts of IR omitted):<br>```<br>%struct.S = type { i32, i32, i8 }<br><br>for.body:                                         ; preds = %for.cond<br>  %2 = load %struct.S*, %struct.S** %b.addr, align 8<br>  %3 = load i32, i32* %i, align 4<br>  %idxprom = sext i32 %3 to i64<br>  %arrayidx = getelementptr inbounds %struct.S, %struct.S* %2, i64 %idxprom<br>  %4 = load %struct.S*, %struct.S** %a.addr, align 8<br>  %5 = load i32, i32* %i, align 4<br>  %idxprom1 = sext i32 %5 to i64<br>  %arrayidx2 = getelementptr inbounds %struct.S, %struct.S* %4, i64 %idxprom1<br>  %6 = bitcast %struct.S* %arrayidx2 to i8*<br>  %7 = bitcast %struct.S* %arrayidx to i8*<br>  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %6, i8* align 4 %7, i64 12, i1 false)<br>  br label %for.inc<br>```<br><br>It can be transformed into a single memcpy:<br><br>```<br>for.body.preheader:                               ; preds = %entry<br>  %b10 = bitcast %struct.S* %b to i8*<br>  %a9 = bitcast %struct.S* %a to i8*<br>  %0 = zext i32 %n to i64<br>  %1 = mul nuw nsw i64 %0, 12<br>  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %a9, i8* align 4 %b10, i64 %1, i1 false)<br>  br label %for.cond.cleanup<br>```<br><br>The problem is, if the copied elements are a class, this doesn't work. For a<br>class with the same members:<br>```<br>%class.C = type <{ i32, i32, i8, [3 x i8] }><br>```<br><br>Clang does some optimization to generate a memcpy of nine bytes, omitting the<br>tail padding:<br><br>```<br>call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %6, i8* align 4 %7, i64 9, i1 false)<br>```<br><br>Then in LLVM, we find the memcpy is not touching every byte of the array, so<br>we abort the transformation.<br><br>If we could tell the untouched three bytes are padding, we should be able to<br>still do the optimization, but LLVM doesn't seem to have this information. I<br>tried using `DataLayout::getTypeStoreSize()`, and it returned 12 bytes. I also<br>tried `StructLayout`, and it treats the tail padding as a regular class member.<br><br>Is there an API in LLVM to tell if a class has tail padding? If not, would it<br>be useful to add this feature?<br><br>Thanks,<br>Han<br></div>