[lldb-dev] UnicodeDecodeError for serialize SBValue description

Wed Apr 20 15:08:40 PDT 2016

Hi Enrico,

Instead of trying function-evaluation c_str(), I decided to decode the
information from fbstring_core fields. I got it working but have two
questions for it.

type summary add -F data_formatter.folly_string_formatter -x
"std::fbstring_core<char>"

Here is the output:

fr v -T small
(std::string) small = "small"

fr v -T small.store_
(std::fbstring_core<char>) small.store_ = None

fr v -T small.store_.ml_
(std::fbstring_core<char>::MediumLarge) small.store_.ml_ = None

Questions:
1. Even I only added formatter for std::fbstring_core<char> why does it
work for std::string?
2. Why the later small.store_ and small.store_.ml_ will show summary None
now? I would not expect the data formatter will happen to them.

Btw: here is the implementation of fbstring_core
https://github.com/facebook/folly/blob/master/folly/FBString.h

Thanks
Jeffrey

On Wed, Apr 13, 2016 at 11:08 AM, Enrico Granata <egranata at apple.com> wrote:

> In theory what you're doing looks like it should be supported. I am not
> sure why your example is failing the way it is.
>
> Is your variable a global maybe?
>
> Also, using the variable name is the wrong thing to do. If you have a
> class with a std::string member, the name is going to return the wrong
> thing. You would want to at least use the expression path - and even then
> there are some cases where we can't cons up a proper expression path.
>
> Sent from my iPhone
>
> On Apr 13, 2016, at 11:02 AM, Jeffrey Tan <jeffrey.fudan at gmail.com> wrote:
>
> I did a quick testing to call SBFrame.EvaluateExpression('string.c_str()')
> for the summary. The result shows valobj.GetFrame() returns None so does
> this mean this is not supported?
>
> def DoTest(valobj,internal_dict):
>     print "valobj: %s" % valobj
>     print "valobj.GetFrame(): %s" % valobj.GetFrame()
>     summaryValue = valobj.GetFrame().EvaluateExpression(valobj.name +
> '.c_str()')
>     print "summaryValue: %s" % summaryValue
>     return 'Summary from c_str(): %s ' % summaryValue.GetSummary()
>
> type summary add -F DoTest -x "std::fbstring_core<char>"
>
> Output:
> valobj.GetFrame(): No value
> summaryValue: No value
> valobj: (std::string) $6 = {
>   store_ = Summary from c_str(): None
> }
>
> Jeffrey
>
> On Wed, Apr 13, 2016 at 10:11 AM, Jeffrey Tan <jeffrey.fudan at gmail.com>
> wrote:
>
>> One quick question: do we support getting type summary string from
>> inferior method call? After reading our own fbstring_core code, I found I
>> need to mirror a lot of what fbstring_core.c_str() method is doing in
>> python. I wonder if we can just call ${var.c_str()} as the type summary? I
>> suspect one of the concern is side-effect(the inferior method may throw
>> exception or cause problems) but I would not see why this can't be done. By
>> allowing this we can keep the data formatter truth one copy(in source code)
>> instead of potential out-of-sync(let say the std::string author decided to
>> change it implementation, the python data formatter associated with it
>> needs to be modified at the same time which is a maintain nightmare).
>>
>> Jeffrey
>>
>> On Thu, Apr 7, 2016 at 10:33 AM, Enrico Granata <egranata at apple.com>
>> wrote:
>>
>>>
>>> On Apr 6, 2016, at 7:31 PM, Jeffrey Tan <jeffrey.fudan at gmail.com> wrote:
>>>
>>> Thanks Enrico. This is very detailed! I will take a look.
>>> Btw: originally, I was hoping that data formatter can be added without
>>> changing the source code. Like giving a xml/json format file telling lldb
>>> the memory layout/structure of the data structure, lldb can parse the
>>> xml/json and deduce the formatting. This is approach used by data
>>> visualizer in VS debugger:
>>> https://msdn.microsoft.com/en-us/library/jj620914.aspx
>>> This will make adding data formatter more extensible/flexible. Any
>>> reason we did not take this approach?
>>>
>>>
>>> The way I understand the Natvis system, it allows one to provide a bunch
>>> of expressions that describe how the debugger would go about retrieving the
>>> interesting data bits
>>> This has the bonus of being really easy, since you’re writing code in
>>> the same language/context of the types you’re formatting
>>> On the other hand it has a few drawbacks, in terms of performance as
>>> well as safety (imagine trying to run code on an object when said object is
>>> in an incoherent state)
>>> The LLDB approach, on the other hand, is that you should try to not run
>>> code when providing these data formatters. In order to do that, we vend an
>>> API that can do things such as retrieve child values, read memory, cast
>>> values, …, all without code execution
>>> Once you have this kind of API that is not expressed in your source
>>> language, you might just as well describe it in a scripting language. Hence
>>> were born the Python data formatters.
>>> In order for us to gain even more performance for native system types
>>> that we know we’re gonna run into all the time, we then switched a bunch of
>>> the “mission critical” formatters from Python to C++
>>> The Python extension points are still available, as Jim pointed out, and
>>> you are more than welcome to use those instead of modifying the debugger
>>> core
>>>
>>> Jeffrey
>>>
>>> On Wed, Apr 6, 2016 at 11:49 AM, Enrico Granata <egranata at apple.com>
>>> wrote:
>>>
>>>>
>>>> On Apr 5, 2016, at 2:42 PM, Jeffrey Tan <jeffrey.fudan at gmail.com>
>>>> wrote:
>>>>
>>>> Hi Enrico,
>>>>
>>>> Any suggestion/example how to add a data formatter for our own STL
>>>> string? From the output below I can see we are using our own "
>>>> *fbstring_core*" which I assume I need to write a type summary for
>>>> this type:
>>>>
>>>> frame variable corpus -T
>>>> (const string &const) corpus = error: summary string parsing error: {
>>>>   (std::*fbstring_core*<char>) store_ = {
>>>>     (std::*fbstring_core*<char>::(anonymous union))  = {
>>>>       (char [24]) small_ = "www"
>>>>       (std::fbstring_core<char>::MediumLarge) ml_ = {
>>>>         (char *) data_ = 0x0000000000777777
>>>> "H\x89U\xa8H\x89M\xa0L\x89E\x98H\x8bE\xa8H\x89��_U��D\x88e�H\x8bE\xa0H\x89��]U��H\x89�H\x8dE�H\x89�H\x89���
>>>> ��L\x8dm�H\x8bE\x98H\x89��IU��\x88]�L\x8be\xb0L\x89��
>>>>         (std::size_t) size_ = 0
>>>>         (std::size_t) capacity_ = 1441151880758558720
>>>>       }
>>>>     }
>>>>   }
>>>> }
>>>>
>>>>
>>>> Admittedly, this is going to be a little vague since I haven’t really
>>>> seen your code and I am only working off of one sample
>>>>
>>>> There’s going to be two parts to getting this to work:
>>>>
>>>> *Part 1 - Formatting fbstring_core*
>>>>
>>>> At a glance, an fbstring_core<char> can be backed by two
>>>> representations. A “small” representation (a char array), and a
>>>> “medium/large" representation (a char* + a size)
>>>> I assume that the way you tell one from the other is
>>>>
>>>> if (size == 0) small
>>>> else medium-large
>>>>
>>>> If my assumption is not correct, you’ll need to discover what the
>>>> correct discriminator logic is - the class has to know, and so do you :-)
>>>>
>>>> Armed with that knowledge, look in lldb
>>>> source/Plugins/Language/CPlusPlus/Formatters/LibCxx.cpp
>>>> There’s a bunch of code that deals with formatting llvm’s libc++
>>>> std::string - which follows a very similar logic to your class
>>>>
>>>> ExtractLibcxxStringInfo() is the function that handles discovering
>>>> which layout the string uses - where the data lives - and how much data
>>>> there is
>>>>
>>>> Once you have told yourself how much data there is (the size) and where
>>>> it lives (array or pointer), LibcxxStringSummaryProvider() has the
>>>> easy task - it sets up a StringPrinter, tells it how much data to print,
>>>> where to get it from, and then delegates the StringPrinter to do the grunt
>>>> work
>>>> StringPrinter is a nifty little tool - it can handle generating
>>>> summaries for different kinds of strings (UTF8? UTF16? we got it - is a \0
>>>> a terminator? what quote character would you like? …) - you point it at
>>>> some data, set up a few options, and it will generate a printable
>>>> representation for you - if your string type is doing anything out of the
>>>> ordinary, let’s talk - I am definitely open to extending StringPrinter to
>>>> handle even more magic
>>>>
>>>> *Part 2 - Teaching std::string that it can be backed by an
>>>> fbstring_core*
>>>>
>>>> At the end of part 1, you’ll probably end up with a
>>>> FBStringCoreSummaryProvider() - now you need to teach LLDB about it
>>>> The obvious thing you could do would be to go in CPlusPlusLanguage
>>>> ::GetFormatters() add a LoadFBStringFormatter(g_category) to it - and
>>>> then imitate - say - LoadLibCxxFormatters()
>>>>
>>>>     AddCXXSummary(cpp_category_sp, lldb_private::formatters::
>>>> FBStringCoreSummaryProvider, “fbstringcore summary provider",
>>>> ConstString(“std::fbstring_core<.+>"), stl_summary_flags, true);
>>>>
>>>> That will work - but what you would see is:
>>>>
>>>> (const string &const) corpus = error: summary string parsing error: {
>>>>   (std::*fbstring_core*<char>) store_ = “www"
>>>>
>>>>
>>>> You wanna do
>>>>
>>>> (lldb) log enable lldb formatters
>>>> (lldb) frame variable -T corpus
>>>>
>>>> It will list one or more typenames - the most specific one is the one
>>>> you like (e.g. for libc++ we get std::__1::string - this is how we tell
>>>> ourselves this is the std::string from libc++)
>>>> Once you find that typename, you’ll make a new formatter -
>>>> FBStringSummaryProvider() - and register that formatter with that very
>>>> specific typename
>>>>
>>>> All that FBStringSummaryProvider() has to do is get the “store_” member
>>>> (ValueObject::GetChildMemberWithName() is your friend) - and pass it down
>>>> to FBStringCoreSummaryProvider()
>>>>
>>>>
>>>> I understand this may seem a little convoluted and arcane at first -
>>>> but feel free to ask more questions, and I’ll try to help out!
>>>>
>>>> Thanks.
>>>> Jeffrey
>>>>
>>>> On Mon, Mar 28, 2016 at 11:38 AM, Enrico Granata <egranata at apple.com>
>>>> wrote:
>>>>
>>>>> This is kind of orthogonal to your problem, but the reason why you are
>>>>> not seeing the kind of simplified printing Greg is suggesting, is because
>>>>> your std::string doesn’t look like any of the kinds we recognize
>>>>>
>>>>> Specifically, LLDB data formatters work by matching against type
>>>>> names, and once they recognize a typename, then they try to inspect the
>>>>> variable in order to grab a summary
>>>>> In your example, your std::string exposes a layout that we are not
>>>>> handling - hence we bail out of the formatter and we fall back to the raw
>>>>> view
>>>>>
>>>>> If you want pretty printing to work, you’ll need to write a data
>>>>> formatter
>>>>>
>>>>> There are a few avenues. The obvious easy one is to extend the
>>>>> existing std::string formatter to recognize your type’s internal layout.
>>>>> If one were signing up for more infrastructure work, they could decide
>>>>> to try and detect shared library loads and load formatters that match with
>>>>> whatever libraries are being loaded.
>>>>>
>>>>> On Mar 28, 2016, at 9:47 AM, Greg Clayton via lldb-dev <
>>>>> lldb-dev at lists.llvm.org> wrote:
>>>>>
>>>>> So you need to be prepared to escape any text that can have special
>>>>> characters. A "std::string" or any container can contain special
>>>>> characters. If you are encoding stuff into JSON, you will either need to
>>>>> escape any special characters, or hex encode the string into ASCII hex
>>>>> bytes.
>>>>>
>>>>> In debuggers we often get bogus data because variables are not
>>>>> initialized, but the compiler tells us that a variable is valid in address
>>>>> range [0x1000-0x2000), but it actually is [0x1200-0x2000). If we read a
>>>>> variable in this case, a std::string might contain bogus data and the bytes
>>>>> might not make sense. So you always have to be prepared for bad data.
>>>>>
>>>>> If we look at:
>>>>>
>>>>>  store_ = {
>>>>>     = {
>>>>>      small_ = "www"
>>>>>      ml_ = (data_ =
>>>>>
>>>>> "��UH\x89�H�}�H\x8bE�]ÐUH\x89�H��H\x89}�H\x8bE�H\x89��~\xb4��\x90��UH\x89�SH\x83�H\x89}�H�u�H�E�H���\x9e���H\x8b\x18H\x8bE�H���O\xb4��H\x89ƿ\b",
>>>>> size_ = 0, capacity_ = 1441151880758558720)
>>>>>    }
>>>>>  }
>>>>> }
>>>>>
>>>>> We can see the "size_" is zero, and capacity_ is 1441151880758558720
>>>>> (which is 0x1400000000000000). "data_" seems to be some random pointer.
>>>>>
>>>>> On MacOSX, we have a special formatting code that displays std::string
>>>>> in CPlusPlusLanguage.cpp that gets installed in the LoadLibCxxFormatters()
>>>>> or LoadLibStdcppFormatters() functions with code like:
>>>>>
>>>>>    lldb::TypeSummaryImplSP std_string_summary_sp(new
>>>>> CXXFunctionSummaryFormat(stl_summary_flags,
>>>>> lldb_private::formatters::LibcxxStringSummaryProvider, "std::string summary
>>>>> provider"));
>>>>>    cpp_category_sp->GetTypeSummariesContainer()->Add(ConstString("std::__1::string"),
>>>>> std_string_summary_sp);
>>>>>
>>>>> Special flags are set on std::string to say "don't show children of
>>>>> this and just show a summary" So if a std::string contained "hello". So for
>>>>> the following code:
>>>>>
>>>>> std::string h ("hello");
>>>>>
>>>>> You should just see:
>>>>>
>>>>> (lldb) fr var h
>>>>> (std::__1::string) h = "hello"
>>>>>
>>>>> If you take a look at the normal value in the raw we see:
>>>>>
>>>>> (lldb) fr var --raw h
>>>>> (std::__1::string) h = {
>>>>>  __r_ = {
>>>>>    std::__1::__libcpp_compressed_pair_imp<std::__1::basic_string<char,
>>>>> std::__1::char_traits<char>, std::__1::allocator<char> >::__rep,
>>>>> std::__1::allocator<char>, 2> = {
>>>>>      __first_ = {
>>>>>         = {
>>>>>          __l = {
>>>>>            __cap_ = 122511465736202
>>>>>            __size_ = 0
>>>>>            __data_ = 0x0000000000000000
>>>>>          }
>>>>>          __s = {
>>>>>             = {
>>>>>              __size_ = '\n'
>>>>>              __lx = '\n'
>>>>>            }
>>>>>            __data_ = {
>>>>>              [0] = 'h'
>>>>>              [1] = 'e'
>>>>>              [2] = 'l'
>>>>>              [3] = 'l'
>>>>>              [4] = 'o'
>>>>>              [5] = '\0'
>>>>>              [6] = '\0'
>>>>>              [7] = '\0'
>>>>>              [8] = '\0'
>>>>>              [9] = '\0'
>>>>>              [10] = '\0'
>>>>>              [11] = '\0'
>>>>>              [12] = '\0'
>>>>>              [13] = '\0'
>>>>>              [14] = '\0'
>>>>>              [15] = '\0'
>>>>>              [16] = '\0'
>>>>>              [17] = '\0'
>>>>>              [18] = '\0'
>>>>>              [19] = '\0'
>>>>>              [20] = '\0'
>>>>>              [21] = '\0'
>>>>>              [22] = '\0'
>>>>>            }
>>>>>          }
>>>>>          __r = {
>>>>>            __words = {
>>>>>              [0] = 122511465736202
>>>>>              [1] = 0
>>>>>              [2] = 0
>>>>>            }
>>>>>          }
>>>>>        }
>>>>>      }
>>>>>    }
>>>>>  }
>>>>> }
>>>>>
>>>>> So the main question is why are our "std::string" formatters not
>>>>> kicking in for you. That comes down to a typename match, or the format of
>>>>> the string isn't what the formatter is expecting.
>>>>>
>>>>> But again, since you std::string can contain anything, you will need
>>>>> to escape any and all text that is encoded into JSON to ensure it doesn't
>>>>> contain anything JSON can't deal with.
>>>>>
>>>>> On Mar 27, 2016, at 9:20 PM, Jeffrey Tan via lldb-dev <
>>>>> lldb-dev at lists.llvm.org> wrote:
>>>>>
>>>>> Thanks Siva. All the DW_TAG_member related errors seems to go away
>>>>> after patching with your fix. The current problem is handling the decoding.
>>>>>
>>>>> Here is the correct decoding from gdb whic might be useful:
>>>>> (gdb) p corpus
>>>>> $3 = (const std::string &) @0x7fd133cfb888: {
>>>>>  static npos = 18446744073709551615, store_ = {
>>>>>    static kIsLittleEndian = <optimized out>,
>>>>>    static kIsBigEndian = <optimized out>, {
>>>>>      small_ = "www", '\000' <repeats 20 times>, "\024", ml_ = {
>>>>>        data_ = 0x777777 <std::_Any_data::_M_access<void
>>>>> folly::fibers::Baton::waitFiber<folly::fibers::FirstArgOf<facebook::servicerouter::RequestDispatcherBase<facebook::servicerouter::ThriftDispatcher>::prepareForSelection(facebook::servicerouter::DispatchContext&)::{lambda(folly::fibers::Promise<facebook::servicerouter::RequestDispatcherBase<facebook::servicerouter::ThriftDispatcher>::prepareForSelection(facebook::servicerouter::DispatchContext&)::SelectionResult>)#1},
>>>>> void>::type::value_type
>>>>> folly::fibers::await<facebook::servicerouter::RequestDispatcherBase<facebook::servicerouter::ThriftDispatcher>::prepareForSelection(facebook::servicerouter::DispatchContext&)::{lambda(folly::fibers::Promise<facebook::servicerouter::RequestDispatcherBase<facebook::servicerouter::ThriftDispatcher>::prepareForSelection(facebook::servicerouter::DispatchContext&)::SelectionResult>)#1}>(folly::fibers::FirstArgOf&&)::{lambda()#1}>(folly::fibers::FiberManager&,
>>>>> folly::fibers::FirstArgOf<folly::fibers::FirstArgOf<facebook::servicerouter::RequestDispatcherBase<facebook::servicerouter::ThriftDispatcher>::prepareForSelection(facebook::servicerouter::DispatchContext&)::{lambda(folly::fibers::Promise<facebook::servicerouter::RequestDispatcherBase<facebook::servicerouter::ThriftDispatcher>::prepareForSelection(facebook::servicerouter::DispatchContext&)::SelectionResult>)#1},
>>>>> void>::type::value_type
>>>>> folly::fibers::await<facebook::servicerouter::RequestDispatcherBase<facebook::servicerouter::ThriftDispatcher>::prepareForSelection(facebook::servicerouter::DispatchContext&)::{lambda(folly::fibers::Promise<facebook::servicerouter::RequestDispatcherBase<facebook::servicerouter::ThriftDispatcher>::prepareForSelection(facebook::servicerouter::DispatchContext&)::SelectionResult>)#1}>(folly::fibers::FirstArgOf&&)::{lambda()#1},
>>>>> void>::type::value_type)::{lambda(folly::fibers::Fiber&)#1}*>() const+25>
>>>>> "\311\303UH\211\345H\211}\370H\213E\370]ÐUH\211\345H\203\354\020H\211}\370H\213E\370H\211\307\350~\264\312\377\220\311\303UH\211\345SH\203\354\030H\211}\350H\211u\340H\213E\340H\211\307\350\236\377\377\377H\213\030H\213E\350H\211\307\350O\264\312\377H\211ƿ\b",
>>>>> size_ = 0,
>>>>>        capacity_ = 1441151880758558720}}}}
>>>>>
>>>>> Utf-16 does not seem to decode it, while 'latin-1' does:
>>>>>
>>>>> '\xc9'.decode('utf-16')
>>>>>
>>>>> Traceback (most recent call last):
>>>>>  File "<stdin>", line 1, in <module>
>>>>>  File
>>>>> "/mnt/gvfs/third-party2/python/55c1fd79d91c77c95932db31a4769919611c12bb/2.7.8/centos6-native/da39a3e/lib/python2.7/encodings/utf_16.py",
>>>>> line 16, in decode
>>>>>    return codecs.utf_16_decode(input, errors, True)
>>>>> UnicodeDecodeError: 'utf16' codec can't decode byte 0xc9 in position
>>>>> 0: truncated data
>>>>>
>>>>> '\xc9'.decode('latin-1')
>>>>>
>>>>> u'\xc9'
>>>>>
>>>>> Instead of guessing what kind of decoding I should use, I would use
>>>>> 'ensure_ascii=False' to prevent the crash for now.
>>>>>
>>>>> I tried to reproduce this crash, but it seems that the crash might be
>>>>> related with some internal stl implementation we are using. I will see if I
>>>>> can narrow down to a small repro later.
>>>>>
>>>>> Thanks
>>>>> Jeffrey
>>>>>
>>>>> On Sun, Mar 27, 2016 at 2:49 PM, Siva Chandra <sivachandra at gmail.com>
>>>>> wrote:
>>>>> On Sat, Mar 26, 2016 at 11:58 PM, Jeffrey Tan <jeffrey.fudan at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Btw: after patching with Siva's fix http://reviews.llvm.org/D18008,
>>>>> the
>>>>> first field 'small_' is fixed, however the second field 'ml_' still
>>>>> emits
>>>>> garbage:
>>>>>
>>>>> (lldb) fr v corpus
>>>>> (const string &const) corpus = error: summary string parsing error: {
>>>>>  store_ = {
>>>>>     = {
>>>>>      small_ = "www"
>>>>>      ml_ = (data_ =
>>>>>
>>>>> "��UH\x89�H�}�H\x8bE�]ÐUH\x89�H��H\x89}�H\x8bE�H\x89��~\xb4��\x90��UH\x89�SH\x83�H\x89}�H�u�H�E�H���\x9e���H\x8b\x18H\x8bE�H���O\xb4��H\x89ƿ\b",
>>>>> size_ = 0, capacity_ = 1441151880758558720)
>>>>>    }
>>>>>  }
>>>>> }
>>>>>
>>>>>
>>>>> Do you still see the DW_TAG_member related error?
>>>>>
>>>>> A wild (and really wild at that) guess: Is it utf16 data that is being
>>>>> decoded as utf8?
>>>>>
>>>>> As David Blaikie mentioned on the other thread, it would really help
>>>>> if you provide us with a minimal example to repro this. Atleast, repro
>>>>> instructions.
>>>>>
>>>>> _______________________________________________
>>>>> lldb-dev mailing list
>>>>> lldb-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> lldb-dev mailing list
>>>>> lldb-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> *- Enrico*
>>>>> 📩 egranata@.com ☎️ 27683
>>>>>
>>>>>
>>>>
>>>>
>>>> Thanks,
>>>> *- Enrico*
>>>> 📩 egranata@.com ☎️ 27683
>>>>
>>>>
>>>
>>>
>>> Thanks,
>>> *- Enrico*
>>> 📩 egranata@.com ☎️ 27683
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20160420/d43607ec/attachment-0001.html>