[lldb-dev] UnicodeDecodeError for serialize SBValue description

Mon Mar 28 16:10:24 PDT 2016

Thanks, I will try this escape mechanism for the returned C string.

On Mon, Mar 28, 2016 at 1:04 PM, Greg Clayton <gclayton at apple.com> wrote:

>
> > On Mar 28, 2016, at 11:38 AM, Jeffrey Tan <jeffrey.fudan at gmail.com>
> wrote:
> >
> > Thanks Greg for the detailed explanation, very helpful.
> > 1. Just to confirm, the weird string displayed is because 'data_' points
> to some random memory?
>
> Yes.
>
> > So what gdb displays is also some random memory content not something
> that more meaningful than us? I thought we(lldb) did not display
> std::string content well but gdb does it correct.
>
> So the "size_" variable is zero, so anything that GDB is displaying is
> shear luck of what the contents of memory are that "data_" points to. You
> can't rely on any contents of "data_" since it is clearly bogus. What you
> really want to see is just the string that "std::string" points to:
>
> (std::string) my_string = "Hello"
>
> Or for a std::string that contains 0, 1, and 2 as characters:
>
> (std::string) my_string = "\x00\x01\x02"
>
>
>
> > 2. I guess the std::string formatter did not kick in because our company
> may link some special stl implementation. Let me share our binary for you
> to confirm.
>
> You can get some help from Enrico to see why things are not displaying
> correctly. My guess is this C++ standard library is different from the ones
> that we added support for.
>
> > 3. I dumped the content of the object we try to json.dumps() against,
> here is the content:
> > response: {'id':      57, 'result': {'result': [{'name': 'data_',
> 'value': {'type': 'object', 'description': '(char *)
> "\xc9\xc3UH\\x89\xe5H\x8
>  9}\xf8H\\x8bE\xf8]\xc3\x90UH\\x89\xe5H\x83\xec\x10H\\x89}\xf8H\\x8bE\xf8H\\x89\xc7\xe8~\\xb4\xca\xff\\x90\xc9\xc3UH\\x89\
>
>  xe5SH\\x83\xec\x18H\\x89}\xe8H\x89u\xe0H\x8bE\xe0H\x89\xc7\xe8\\x9e\xff\xff\xffH\\x8b\\x18H\\x8bE\xe8H\x89\xc7\xe8O\\xb4\
>    xca\xffH\\x89\xc6\xbf\\b"', 'objectId': 'RemoteObjectManager.118'}},
> {'name': 'size_', 'value': {'type': 'object', 'descr     iption':
> '(std::size_t) 0'}}, {'name': 'capacity_', 'value': {'type': 'object',
> 'description': '(std::size_t) 14411518807     58558720'}}]}}
> > So seems that the problem is json.dumps() is trying to treat the raw
> byte array as utf8 which failed.
> > So we need to figure out how to escape the raw byte array into string so
> that we can json.dumps() it. The key question is how do we know the correct
> encoding of the byte array.
>
> It doesn't really matter. Just know that any of the strings from:
>
> const char *SBValue::GetName();
> const char *SBValue::GetTypeName ();
> const char *SBValue::GetDisplayTypeName();
> const char *SBValue::GetValue();
> const char *SBValue::GetSummary();
> const char *SBValue::GetObjectDescription();
> const char *SBValue::GetLocation ();
>
> Will need to be escaped.
>
> > Is my understanding correct that only the formatter has the knowledge to
> decode the byte array correctly?
>
> We dump the values as strings. You won't get bytes out. You might get UTF8
> bytes or other things that JSON might interpret as special characters and
> any C strings that you get from the above calls will just need to be
> escaped if needed.
>
> > If we fail to find a type formatter(which is this case) and get a raw
> field with byte array, we have no knowledge of the encoding so either we
> have to guess one default encoding and try it or just display the raw byte
> array content instead of decoding it?
>
> Again, this is all C strings. I don't think anything else matters.
>
> Our JSON.cpp has the following:
>
> int
> JSONParser::GetEscapedChar(bool &was_escaped)
> {
>     was_escaped = false;
>     const char ch = GetChar();
>     if (ch == '\\')
>     {
>         was_escaped = true;
>         const char ch2 = GetChar();
>         switch (ch2)
>         {
>             case '"':
>             case '\\':
>             case '/':
>             default:
>                 break;
>
>             case 'b': return '\b';
>             case 'f': return '\f';
>             case 'n': return '\n';
>             case 'r': return '\r';
>             case 't': return '\t';
>             case 'u':
>                 {
>                     const int hi_byte = DecodeHexU8();
>                     const int lo_byte = DecodeHexU8();
>                     if (hi_byte >=0 && lo_byte >= 0)
>                         return hi_byte << 8 | lo_byte;
>                     return -1;
>                 }
>                 break;
>         }
>         return ch2;
>     }
>     return ch;
> }
>
> You can see how it is used when the JSON parser is parsing in
> JSONParser::GetToken() in the '"' case.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20160328/5ac52b81/attachment.html>