[PATCH] Implement function type checker for the undefined behavior sanitizer.

Fri Oct 18 16:09:55 PDT 2013

  > It would be polite to issue a diagnostic if someone explicitly asks for this (not via -fsanitize=undefined) and we don't support it for their target.

  Hmm, we should probably be doing this for other sanitizers too (e.g. MSan and DFSan are only supported on 64-bit Linux).


================
Comment at: lib/CodeGen/CGExpr.cpp:3128
@@ +3127,3 @@
+  if (getLangOpts().CPlusPlus && SanOpts->Function &&
+      !dyn_cast_or_null<FunctionDecl>(TargetDecl)) {
+    if (llvm::Constant* PDSig =
----------------
Richard Smith wrote:
> This is a bit hard to parse. `(!TargetDecl || !isa<FunctionDecl>(TargetDecl))` maybe?
Done.

================
Comment at: lib/CodeGen/CGExpr.cpp:3129
@@ +3128,3 @@
+      !dyn_cast_or_null<FunctionDecl>(TargetDecl)) {
+    if (llvm::Constant* PDSig =
+            CGM.getTargetCodeGenInfo().getUBSanFunctionSignature(CGM)) {
----------------
Richard Smith wrote:
> `*` on the right.
Done.

================
Comment at: lib/CodeGen/CGExpr.cpp:3133-3134
@@ +3132,4 @@
+          CGM.GetAddrOfRTTIDescriptor(QualType(FnType, 0), /*ForEH=*/true);
+      llvm::Type *PDStructTyElems[] = {PDSig->getType(),
+                                       FTRTTIConst->getType()};
+      llvm::StructType *PDStructTy = llvm::StructType::get(
----------------
Richard Smith wrote:
> Richard Smith wrote:
> > We usually format this as
> > 
> >     llvm::Type *PDStructTyElems[] = {
> >       PDSig->getType(),
> >       FTRTTIConst->getType()
> >     };
> > 
> > or similar.
> Some indication of what "PD" stands for here would be useful.
Done.  clang-format prefers something closer to the formatting I originally used; I'll raise this with the clang-format folks.

================
Comment at: lib/CodeGen/CGExpr.cpp:3138-3142
@@ +3137,7 @@
+
+      llvm::Value *CalleePDStruct = Builder.CreateBitCast(
+          Callee, llvm::PointerType::getUnqual(PDStructTy));
+      llvm::Value *CalleeSigPtr =
+          Builder.CreateConstGEP2_32(CalleePDStruct, 0, 0);
+      llvm::Value *CalleeSig = Builder.CreateLoad(CalleeSigPtr);
+      llvm::Value *CalleeSigMatch = Builder.CreateICmpEQ(CalleeSig, PDSig);
----------------
Richard Smith wrote:
> What happens if the callee address is at the end of a page and doesn't have the extra data? Could this load trigger a segfault?
It could in principle.  In practice, Clang always aligns function bodies to 16 bytes, GCC does it at -O>=2 and GCC (4.4 and 4.6) codegens an empty function body at -O0 with >4 bytes.  A quick survey of a number of system libraries (libc, libm, libstdc++) on an Ubuntu machine reveals that the functions are suitably aligned.  I think the circumstances in which a segfault might occur are sufficiently rare as to justify doing a single load.  If it does turn out to be a problem in practice we could consider splitting the load (perhaps conditional on a command line flag).

================
Comment at: lib/CodeGen/CGExpr.cpp:3155-3156
@@ +3154,4 @@
+          Builder.CreateICmpEQ(CalleeRTTI, FTRTTIConst);
+      llvm::Constant *StaticData[] = {EmitCheckSourceLocation(CallLoc),
+                                      EmitCheckTypeDescriptor(CalleeType)};
+      EmitCheck(CalleeRTTIMatch,
----------------
Richard Smith wrote:
> Likewise.
Done.

================
Comment at: lib/CodeGen/CodeGenFunction.cpp:523-524
@@ -521,1 +522,4 @@
 
+  // If we are checking function types, emit a function type signature as
+  // prefix data.
+  if (getLangOpts().CPlusPlus && SanOpts->Function) {
----------------
Richard Smith wrote:
> It'd be nice if we could only do this for address-taken functions. When we take the address of a function, we could emit `linkonce_odr` thunk with the extra data, and use that instead of the original address... except that would break address-of-function comparisons between instrumented and uninstrumented code. Can we provide an option to enable that behavior for the case where we're OK with an ABI change?
I like the idea of doing this as an option, but I would prefer to land this version first.

================
Comment at: lib/CodeGen/CodeGenFunction.cpp:529-530
@@ +528,4 @@
+              CGM.getTargetCodeGenInfo().getUBSanFunctionSignature(CGM)) {
+        llvm::Constant *FTRTTIConst =
+            CGM.GetAddrOfRTTIDescriptor(FD->getType(), /*ForEH=*/true);
+        llvm::Constant *PDStructElems[] = {PDSig, FTRTTIConst};
----------------
Richard Smith wrote:
> Does this include the calling convention attributes?
Not with the Itanium ABI, which does not have a representation for calling conventions, but the Microsoft ABI (for which calling conventions matter more) does, so once RTTI has been implemented for that ABI, this should respect calling conventions there.

================
Comment at: lib/CodeGen/CodeGenFunction.cpp:531
@@ +530,3 @@
+            CGM.GetAddrOfRTTIDescriptor(FD->getType(), /*ForEH=*/true);
+        llvm::Constant *PDStructElems[] = {PDSig, FTRTTIConst};
+        llvm::Constant *PDStructConst =
----------------
Richard Smith wrote:
> Spaces around braces.
Done.

================
Comment at: lib/CodeGen/CGExpr.cpp:3133
@@ +3132,3 @@
+          CGM.GetAddrOfRTTIDescriptor(QualType(FnType, 0), /*ForEH=*/true);
+      llvm::Type *PDStructTyElems[] = {PDSig->getType(),
+                                       FTRTTIConst->getType()};
----------------
Peter Collingbourne wrote:
> Richard Smith wrote:
> > Richard Smith wrote:
> > > We usually format this as
> > > 
> > >     llvm::Type *PDStructTyElems[] = {
> > >       PDSig->getType(),
> > >       FTRTTIConst->getType()
> > >     };
> > > 
> > > or similar.
> > Some indication of what "PD" stands for here would be useful.
> Done.  clang-format prefers something closer to the formatting I originally used; I'll raise this with the clang-format folks.
Globally replaced 'PD' with 'Prefix'.

================
Comment at: lib/CodeGen/TargetInfo.cpp:607-608
@@ +606,4 @@
+                   (0x06 << 8) |  //           .+0x08
+                   ('F' << 16) |
+                   ('T' << 24);
+    return llvm::ConstantInt::get(CGM.Int32Ty, Sig);
----------------
Richard Smith wrote:
> What do 'F' and 'T' demangle as? An invalid instruction encoding would give me more confidence here.
"rex.RX push %rsp", according to objdump.  But I don't think it really matters what this decodes to.  If the instruction pointer finds itself here the situation isn't dissimilar to jumping into the middle of a multibyte instruction.  (Besides, the RTTI pointer could decode to anything.)


http://llvm-reviews.chandlerc.com/D1338