[lldb-dev] UnicodeDecodeError for serialize SBValue description

Mon Mar 28 13:04:49 PDT 2016

> On Mar 28, 2016, at 11:38 AM, Jeffrey Tan <jeffrey.fudan at gmail.com> wrote:
> 
> Thanks Greg for the detailed explanation, very helpful. 
> 1. Just to confirm, the weird string displayed is because 'data_' points to some random memory?

Yes.

> So what gdb displays is also some random memory content not something that more meaningful than us? I thought we(lldb) did not display std::string content well but gdb does it correct. 

So the "size_" variable is zero, so anything that GDB is displaying is shear luck of what the contents of memory are that "data_" points to. You can't rely on any contents of "data_" since it is clearly bogus. What you really want to see is just the string that "std::string" points to:

(std::string) my_string = "Hello"

Or for a std::string that contains 0, 1, and 2 as characters:

(std::string) my_string = "\x00\x01\x02"

> 2. I guess the std::string formatter did not kick in because our company may link some special stl implementation. Let me share our binary for you to confirm.

You can get some help from Enrico to see why things are not displaying correctly. My guess is this C++ standard library is different from the ones that we added support for.

> 3. I dumped the content of the object we try to json.dumps() against, here is the content:
> response: {'id':      57, 'result': {'result': [{'name': 'data_', 'value': {'type': 'object', 'description': '(char *) "\xc9\xc3UH\\x89\xe5H\x8     9}\xf8H\\x8bE\xf8]\xc3\x90UH\\x89\xe5H\x83\xec\x10H\\x89}\xf8H\\x8bE\xf8H\\x89\xc7\xe8~\\xb4\xca\xff\\x90\xc9\xc3UH\\x89\     xe5SH\\x83\xec\x18H\\x89}\xe8H\x89u\xe0H\x8bE\xe0H\x89\xc7\xe8\\x9e\xff\xff\xffH\\x8b\\x18H\\x8bE\xe8H\x89\xc7\xe8O\\xb4\     xca\xffH\\x89\xc6\xbf\\b"', 'objectId': 'RemoteObjectManager.118'}}, {'name': 'size_', 'value': {'type': 'object', 'descr     iption': '(std::size_t) 0'}}, {'name': 'capacity_', 'value': {'type': 'object', 'description': '(std::size_t) 14411518807     58558720'}}]}}
> So seems that the problem is json.dumps() is trying to treat the raw byte array as utf8 which failed. 
> So we need to figure out how to escape the raw byte array into string so that we can json.dumps() it. The key question is how do we know the correct encoding of the byte array.

It doesn't really matter. Just know that any of the strings from:

const char *SBValue::GetName();
const char *SBValue::GetTypeName ();
const char *SBValue::GetDisplayTypeName();
const char *SBValue::GetValue();
const char *SBValue::GetSummary();
const char *SBValue::GetObjectDescription();
const char *SBValue::GetLocation ();

Will need to be escaped.

> Is my understanding correct that only the formatter has the knowledge to decode the byte array correctly?

We dump the values as strings. You won't get bytes out. You might get UTF8 bytes or other things that JSON might interpret as special characters and any C strings that you get from the above calls will just need to be escaped if needed.

> If we fail to find a type formatter(which is this case) and get a raw field with byte array, we have no knowledge of the encoding so either we have to guess one default encoding and try it or just display the raw byte array content instead of decoding it?  

Again, this is all C strings. I don't think anything else matters.

Our JSON.cpp has the following:

int
JSONParser::GetEscapedChar(bool &was_escaped)
{
    was_escaped = false;
    const char ch = GetChar();
    if (ch == '\\')
    {
        was_escaped = true;
        const char ch2 = GetChar();
        switch (ch2)
        {
            case '"':
            case '\\':
            case '/':
            default:
                break;

            case 'b': return '\b';
            case 'f': return '\f';
            case 'n': return '\n';
            case 'r': return '\r';
            case 't': return '\t';
            case 'u':
                {
                    const int hi_byte = DecodeHexU8();
                    const int lo_byte = DecodeHexU8();
                    if (hi_byte >=0 && lo_byte >= 0)
                        return hi_byte << 8 | lo_byte;
                    return -1;
                }
                break;
        }
        return ch2;
    }
    return ch;
}

You can see how it is used when the JSON parser is parsing in JSONParser::GetToken() in the '"' case.