[LLVMdev] [PATCH / PROPOSAL] bitcode encoding that is ~15% smaller for large bitcode files...

Wed Sep 26 13:44:06 PDT 2012

Thanks for the review Duncan!

Yes, on second thought, it would be simpler (less bookkeeping)
if this was toggled at the module level (but I haven't looked into
how that would be done).  I had done it at the function level
to begin with because it affected only the function bodies,
so it would be more obvious what changed.  Anyone have a
stronger feeling (or a reason!) why it would be better to do this at
a per-function vs per-module basis?

I'll look into how to do it at the module level and send out a new
patch later + with your comments addressed.

On Wed, Sep 26, 2012 at 1:31 AM, Duncan Sands <baldrick at free.fr> wrote:

> Hi Jan,
>
>
>  I've been looking into how to make llvm bitcode files smaller.  There is
>> one
>> simple change that appears to shrink linked bitcode files by about 15%.
>>  See
>> this spreadsheet for some rough data:
>>
>> https://docs.google.com/**spreadsheet/ccc?key=**0AjRrJHQc4_**
>> bddEtJdjdIek5fMDdIdFFIZldZXzdW**a0E<https://docs.google.com/spreadsheet/ccc?key=0AjRrJHQc4_bddEtJdjdIek5fMDdIdFFIZldZXzdWa0E>
>>
>
> the improvement is wonderful!
>
> ...
>
>
>  In any case, the patch is attached if there is interest...  If you want
>> to try
>> out the patch, you can toggle between the new and old encoding by using
>> the flag
>> "-enable-old-style-functions" vs no flags, with llvm-as.
>>
>
> I don't know anything much about the bitcode representation, but anyway:
>
>  diff --git a/include/llvm/Bitcode/**BitstreamReader.h
>> b/include/llvm/Bitcode/**BitstreamReader.h
>> index 3daef78..840f57e 100644
>> --- a/include/llvm/Bitcode/**BitstreamReader.h
>> +++ b/include/llvm/Bitcode/**BitstreamReader.h
>> @@ -409,7 +409,7 @@ public:
>>    }
>>
>>    /// EnterSubBlock - Having read the ENTER_SUBBLOCK abbrevid, enter
>> -  /// the block, and return true if the block is valid.
>> +  /// the block, and return true if the block has an error.
>>    bool EnterSubBlock(unsigned BlockID, unsigned *NumWordsP = 0) {
>>      // Save the current block's state on BlockScope.
>>      BlockScope.push_back(Block(**CurCodeSize));
>> diff --git a/include/llvm/Bitcode/**LLVMBitCodes.h
>> b/include/llvm/Bitcode/**LLVMBitCodes.h
>> index c1dc190..6c81739 100644
>> --- a/include/llvm/Bitcode/**LLVMBitCodes.h
>> +++ b/include/llvm/Bitcode/**LLVMBitCodes.h
>> @@ -33,9 +33,9 @@ namespace bitc {
>>      UNUSED_ID1,
>>
>>      CONSTANTS_BLOCK_ID,
>> -    FUNCTION_BLOCK_ID,
>> +    FUNCTION_BLOCK_ID,  // Deprecated in favor of FUNCTION_BLOCK_ID_REL
>>
>
> FUNCTION_BLOCK_ID_REL -> FUNCTION_BLOCK_REL_ID
>
> Is there any point in having this on a per-function basis, why not have it
> be
> per-module?
>
>
>> -    UNUSED_ID2,
>> +    FUNCTION_BLOCK_REL_ID,
>>
>>      VALUE_SYMTAB_BLOCK_ID,
>>      METADATA_BLOCK_ID,
>> @@ -257,8 +257,9 @@ namespace bitc {
>>      SYNCHSCOPE_CROSSTHREAD = 1
>>    };
>>
>> -  // The function body block (FUNCTION_BLOCK_ID) describes function
>> bodies.  It
>> -  // can contain a constant block (CONSTANTS_BLOCK_ID).
>> +  // The function body block (FUNCTION_BLOCK_ID and
>> FUNCTION_BLOCK_REL_ID)
>> +  // describes function bodies.  It can contain a constant block
>> +  // (CONSTANTS_BLOCK_ID).
>>    enum FunctionCodes {
>>      FUNC_CODE_DECLAREBLOCKS    =  1, // DECLAREBLOCKS: [n]
>>
>> diff --git a/lib/Bitcode/Reader/**BitcodeReader.cpp b/lib/Bitcode/Reader/
>> **BitcodeReader.cpp
>> index b69a362..5fd3dfd 100644
>> --- a/lib/Bitcode/Reader/**BitcodeReader.cpp
>> +++ b/lib/Bitcode/Reader/**BitcodeReader.cpp
>>
> ...
>
>> @@ -1495,6 +1497,12 @@ bool BitcodeReader::ParseModule(**bool Resume) {
>>            SeenFirstFunctionBody = true;
>>          }
>>
>> +        // Remember the encoding style for the function.
>> +        if (FunctionsWithBodies.empty())
>> +          return Error("Insufficient function protos");
>>
>
> Why this new error?  Or would it have occurred anyway, later?  Maybe it
> should be an assertion, if it should be impossible.  I think "prototypes"
> would be better than "protos".
>

Yes, it would have occurred later (originally).  I had copied this check
from
the "RememberAndSkipFunctionBody()" function, as a quick way
to test this out.

>
>  +        FuncUseRelativeIDs[**FunctionsWithBodies.back()] =
>> +            (SubBlockID == bitc::FUNCTION_BLOCK_REL_ID);
>>
>
> If the use of relative ids was on a per module basis then this wouldn't be
> needed.
>
>  +
>>          if (RememberAndSkipFunctionBody()**)
>>            return true;
>>          // For streaming bitcode, suspend parsing when we reach the
>> function
>> @@ -1901,8 +1909,15 @@ bool BitcodeReader::**ParseMetadataAttachment() {
>>
>>  /// ParseFunctionBody - Lazily parse the specified function body block.
>>  bool BitcodeReader::**ParseFunctionBody(Function *F) {
>> -  if (Stream.EnterSubBlock(bitc::**FUNCTION_BLOCK_ID))
>> -    return Error("Malformed block record");
>> +  if (FuncUseRelativeIDs[F]) {
>> +    if (Stream.EnterSubBlock(bitc::**FUNCTION_BLOCK_REL_ID))
>> +      return Error("Malformed block record");
>> +    UseRelativeIDs = true;
>> +  } else {
>> +    if (Stream.EnterSubBlock(bitc::**FUNCTION_BLOCK_ID))
>> +      return Error("Malformed block record");
>> +    UseRelativeIDs = false;
>> +  }
>>
>
> Likewise.
>
>
>>    InstructionList.clear();
>>    unsigned ModuleValueListSize = ValueList.size();
>>
> ...
>
>> @@ -2260,8 +2275,10 @@ bool BitcodeReader::**ParseFunctionBody(Function
>> *F) {
>>        }
>>        else {
>>          BasicBlock *FalseDest = getBasicBlock(Record[1]);
>> -        Value *Cond = getFnValueByID(Record[2],
>> Type::getInt1Ty(Context));
>> -        if (FalseDest == 0 || Cond == 0)
>> +        Value *Cond;
>> +        if (getValueConst(Record, 2,
>> +                          NextValueNo, Type::getInt1Ty(Context), Cond) ||
>> +            FalseDest == 0 || Cond == 0)
>>
>
> This seems rather ugly and complicated...  Can't your new method just
> return
> "Cond", and return null in case of an error?  Not sure the name is the best
> either, it doesn't suggest at all to me that it doesn't increment the slot
> number, it sounds like it's for handling constants.
>
>
Yes, I agree it's pretty ugly as it is now.  It was modeled after the
"getValueTypePair" function, which would have signaled an error
when the result was null, so your suggestion should work.

>             return Error("Invalid BR record");
>>          I = BranchInst::Create(TrueDest, FalseDest, Cond);
>>          InstructionList.push_back(I);
>> @@ -2276,9 +2293,10 @@ bool BitcodeReader::**ParseFunctionBody(Function
>> *F) {
>>          Type *OpTy = getTypeByID(Record[1]);
>>          unsigned ValueBitWidth = cast<IntegerType>(OpTy)->**
>> getBitWidth();
>>
>> -        Value *Cond = getFnValueByID(Record[2], OpTy);
>> +        Value *Cond;
>>          BasicBlock *Default = getBasicBlock(Record[3]);
>> -        if (OpTy == 0 || Cond == 0 || Default == 0)
>> +        if (getValueConst(Record, 2, NextValueNo, OpTy, Cond) ||
>> +            OpTy == 0 || Cond == 0 || Default == 0)
>>            return Error("Invalid SWITCH record");
>>
>
> Likewise (more instances follow).
>
> ...
>
>> @@ -2444,7 +2467,8 @@ bool BitcodeReader::**ParseFunctionBody(Function
>> *F) {
>>        InstructionList.push_back(PN);
>>
>>        for (unsigned i = 0, e = Record.size()-1; i != e; i += 2) {
>> -        Value *V = getFnValueByID(Record[1+i], Ty);
>> +        Value *V = 0;
>> +        getValueConst(Record, 1+i, NextValueNo, Ty, V);
>>
>
> You didn't check the return value.
>
>           BasicBlock *BB = getBasicBlock(Record[2+i]);
>>          if (!V || !BB) return Error("Invalid PHI record");
>>          PN->addIncoming(V, BB);
>>
> ...
>
>> diff --git a/lib/Bitcode/Reader/**BitcodeReader.h b/lib/Bitcode/Reader/**
>> BitcodeReader.h
>> index e7c4e94..249d723 100644
>> --- a/lib/Bitcode/Reader/**BitcodeReader.h
>> +++ b/lib/Bitcode/Reader/**BitcodeReader.h
>>
> ...
>
>> @@ -260,14 +274,30 @@ private:
>>      ResVal = getFnValueByID(ValNo, getTypeByID(TypeNo));
>>      return ResVal == 0;
>>    }
>> +  /// getValue - Read a value out of the specified record from slot
>> 'Slot'.
>> +  /// Increment Slot past the number of slots used in the record.
>> +  /// Return true on failure.
>>    bool getValue(SmallVector<uint64_t, 64> &Record, unsigned &Slot,
>> -                Type *Ty, Value *&ResVal) {
>> +                unsigned InstNum, Type *Ty, Value *&ResVal) {
>>      if (Slot == Record.size()) return true;
>>      unsigned ValNo = (unsigned)Record[Slot++];
>> +    // Adjust the ValNo, if it was encoded relative to the InstNum.
>> +    if (UseRelativeIDs)
>> +      ValNo = InstNum - ValNo;
>> +    ResVal = getFnValueByID(ValNo, Ty);
>> +    return ResVal == 0;
>> +  }
>> +  /// getValueConst -- Like getValue, but does not increment the Slot
>> number.
>> +  bool getValueConst(SmallVector<**uint64_t, 64> &Record, unsigned Slot,
>> +                     unsigned InstNum, Type *Ty, Value *&ResVal) {
>> +    if (Slot == Record.size()) return true;
>> +    unsigned ValNo = (unsigned)Record[Slot];
>> +    // Adjust the ValNo, if it was encoded relative to the InstNum.
>> +    if (UseRelativeIDs)
>> +      ValNo = InstNum - ValNo;
>>      ResVal = getFnValueByID(ValNo, Ty);
>>      return ResVal == 0;
>>    }
>>
>
> How about having just one method, eg getValue, and simply decrement the
> slot
> number by hand in those places that need it.  If you really want two,
> implement
> getValue in terms of getValueConst, by calling it then incrementing the
> slot
> number for example.
>
>
Good suggestions, will try it out.

>  -
>>
>>    bool ParseModule(bool Resume);
>>    bool ParseAttributeBlock();
>> diff --git a/lib/Bitcode/Writer/**BitcodeWriter.cpp b/lib/Bitcode/Writer/
>> **BitcodeWriter.cpp
>> index b3f1bb1..e3ebe38 100644
>> --- a/lib/Bitcode/Writer/**BitcodeWriter.cpp
>> +++ b/lib/Bitcode/Writer/**BitcodeWriter.cpp
>>
> ...
>
>> @@ -55,7 +62,7 @@ enum {
>>    CONSTANTS_CE_CAST_Abbrev,
>>    CONSTANTS_NULL_Abbrev,
>>
>> -  // FUNCTION_BLOCK abbrev id's.
>> +  // FUNCTION_BLOCK_REL abbrev id's.
>>
>
> Should this comment have been changed?
>
>     FUNCTION_INST_LOAD_ABBREV = bitc::FIRST_APPLICATION_**ABBREV,
>>    FUNCTION_INST_BINOP_ABBREV,
>>    FUNCTION_INST_BINOP_FLAGS_**ABBREV,
>>
> ...
>
>> @@ -1025,12 +1037,15 @@ static void WriteModuleConstants(const
>> ValueEnumerator &VE,
>>  ///
>>  /// This function adds V's value ID to Vals.  If the value ID is higher
>> than the
>>  /// instruction ID, then it is a forward reference, and it also includes
>> the
>> -/// type ID.
>> +/// type ID.  The value ID that is writtne is encoded as relative to the
>> InstID.
>>
>
> writtne -> written
> is encoded as relative to -> is encoded relative to
>
>   static bool PushValueAndType(const Value *V, unsigned InstID,
>>                               SmallVector<unsigned, 64> &Vals,
>>                               ValueEnumerator &VE) {
>>    unsigned ValID = VE.getValueID(V);
>> -  Vals.push_back(ValID);
>> +  if (EnableOldStyleFunctions)
>> +    Vals.push_back(ValID);
>> +  else
>> +    Vals.push_back(InstID - ValID);
>>    if (ValID >= InstID) {
>>      Vals.push_back(VE.getTypeID(V-**>getType()));
>>      return true;
>>
> ...
>
>> @@ -1164,7 +1191,12 @@ static void WriteInstruction(const Instruction &I,
>> unsigned InstID,
>>        Vals64.push_back(**SwitchRecordHeader);
>>
>>        Vals64.push_back(VE.getTypeID(**SI.getCondition()->getType()))**;
>> -      Vals64.push_back(VE.**getValueID(SI.getCondition()))**;
>> +      // ValueID for non-BB / literal operands are relative to the
>> InstID.
>> +      // (e.g., the condition value).
>> +      if (EnableOldStyleFunctions)
>> +        Vals64.push_back(VE.**getValueID(SI.getCondition()))**;
>> +      else
>> +        Vals64.push_back(InstID - VE.getValueID(SI.getCondition(**)));
>>
>
> Shouldn't you use a "push" method like PushValue for this too?
>

These were different because they used uint64_t with a "Vals64" vector.
I could make a PushValue64.

>
>         Vals64.push_back(VE.**getValueID(SI.getDefaultDest()**));
>>        Vals64.push_back(SI.**getNumCases());
>>        for (SwitchInst::CaseIt i = SI.case_begin(), e = SI.case_end();
>>
> ...
>
>> @@ -1358,8 +1392,13 @@ static void WriteInstruction(const Instruction &I,
>> unsigned InstID,
>>      PushValueAndType(CI.**getCalledValue(), InstID, Vals, VE);  //
>> Callee
>>
>>      // Emit value #'s for the fixed parameters.
>> -    for (unsigned i = 0, e = FTy->getNumParams(); i != e; ++i)
>> -      Vals.push_back(VE.getValueID(**CI.getArgOperand(i)));  // fixed
>> param.
>> +    for (unsigned i = 0, e = FTy->getNumParams(); i != e; ++i) {
>> +      if (FTy->getParamType(i)->**isLabelTy()) {
>>
>
> Is this actually possible?
>
>
I'm not sure =)

I had originally added it because I noticed that the code in BitcodeReader
for FUNC_CODE_INST_CALL had the check:

 if (FTy->getParamType(i)->isLabelTy())
          Args.push_back(getBasicBlock(Record[OpNum]));

I'll take this modification out, since it's orthogonal, and I'm not sure
what the original motivation was.

 +        Vals.push_back(VE.getValueID(**CI.getArgOperand(i)));
>> +      } else {
>> +        PushValue(CI.getArgOperand(i), InstID, Vals, VE);  // fixed
>> param.
>> +      }
>> +    }
>>
>>      // Emit type/value pairs for varargs params.
>>      if (FTy->isVarArg()) {
>>
> ...
>
>> @@ -1514,8 +1553,8 @@ static void WriteFunction(const Function &F,
>> ValueEnumerator &VE,
>>  // Emit blockinfo, which defines the standard abbreviations etc.
>>  static void WriteBlockInfo(const ValueEnumerator &VE, BitstreamWriter
>> &Stream) {
>>    // We only want to emit block info records for blocks that have
>> multiple
>> -  // instances: CONSTANTS_BLOCK, FUNCTION_BLOCK and VALUE_SYMTAB_BLOCK.
>>  Other
>> -  // blocks can defined their abbrevs inline.
>> +  // instances: CONSTANTS_BLOCK, FUNCTION_BLOCK_REL and
>> VALUE_SYMTAB_BLOCK.
>> +  // Other blocks can defined their abbrevs inline.
>>
>
> Should this comment really have changed?
>
>     Stream.EnterBlockInfoBlock(2);
>>
>>    { // 8-bit fixed-width VST_ENTRY/VST_BBENTRY strings.
>> @@ -1603,68 +1642,68 @@ static void WriteBlockInfo(const ValueEnumerator
>> &VE, BitstreamWriter &Stream) {
>>
>>    // FIXME: This should only use space for first class types!
>>
>> -  { // INST_LOAD abbrev for FUNCTION_BLOCK.
>> +  { // INST_LOAD abbrev for FUNCTION_BLOCK_REL.
>>
>
> Likewise (occurs many more times).
>
> Ciao, Duncan.
> ______________________________**_________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/**mailman/listinfo/llvmdev<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>

Thanks!
- Jan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120926/09faec2e/attachment.html>