[LLVMdev] Regular Expression lib support

Sun Aug 23 21:11:28 PDT 2009

On Sun, Aug 23, 2009 at 8:28 PM, Chris Lattner<clattner at apple.com> wrote:
>
> On Aug 23, 2009, at 5:50 PM, OvermindDL1 wrote:
>
>> On Sun, Aug 23, 2009 at 6:32 PM, Daniel Dunbar<daniel at zuster.org> wrote:
>>>
>>> This is too heavy, and we don't need the extra features, and regexec
>>> is well tested and much more standard. Unless there is an overwhelming
>>
>> 'regexec' I had never heard of, figured it was a library, turns out it
>> is a function call on *nix systems, yea, that is very much not usable
>> in any way shape or form, and is certainly not a standard if it does
>> not work on one of the major LLVM platforms (and it is still not a
>> standard in any pure form since it is not part of the C/C++ standard
>> headers).  If that is option #2, then option #2 is very unusable.
>>
>> And yes, if you must know, I program on Windows, which is why I am
>> pushing to use something that actually works everywhere instead of
>> just someone's favorite OS (I prefer BSD honestly, but Windows is what
>> the desktop world is sadly stuck on, so that is what I have to program
>> for).
>
> I think you're seriously confused about the proposal.  To put it bluntly,
> there is no way we'll use boosts regex support, sorry.
>
> The proposal is to use the unix standard regexec library interface.  The
> LLVM tree would include an imported BSD-licenced implementation from one of
> many sources.  We'd then have configury logic detect when the host OS
> already supports the regexec interfaces, and if so, don't build our imported
> copy.
>
> We'd have a simple layer on top of it to make the interface to the regex
> library less horrible than what regexec provides.
>
> Again, forget boost regex. :)

What about std::regex?  It is even now still a lot more standard then
regexec (which does not exist on my platform).  Of which Boost just
happens to have a fully standards conforming implementation (as does
dinkumware too I think...).

In comparison (since you mentioned horrible interfaces), how about
something simple like returning a float followed by a pipe followed by
an integer list separated by commas and parsing it into a
std::pair<float,vector<int> > myPair;  Also, I might be wrong on some
of the regex syntax, it has been a *long* time since I ever touched
regex inside a program (ever since I started using boost::spirit, only
use it for grep and its kin anymore), but it should get the idea
across.  I use this regex for each of the things below:
  std::string regexStr("(-?\\d+(?:\\.\\d*)?)\\|(-?\\d+)*(?:\\,(-?\\d+))");
// I hope this is correct...
  std::string testStr("3.14|1,2,3,42,128");

The "(-?\\d+(?:\\.\\d*)?)\\|(-?\\d+)*(?:\\,(-?\\d+))" does not check
that the checked integer fits inside an int, does not account for
overflow (I could set a max repeat as well, but that still means it
will be too short to fit, or could overflow) for note...

regexec/or_whatever_in_regex.h_that_is_*nix_only (*nix standard, not
anywhere else):  // dynamic regex, so it is slow in comparison to
other alternatives
  I have no clue how use this one, to be honest, I do not know the
syntax, if anyone else could show me how for this example?  I doubt it
would be any shorter or faster to write (and especially to execute)
then the spirit example for note.

std/tr1/boost::regex (the C++ standard):  // dynamic regex, so it is
slow in comparison to other alternatives
bool parse_test(std::string &testStr, myPair &ret)
{
  match_results<IteratorType> m;
  regex e(regexStr);
  bool successful = regex_match(testStr.begin(),testStr.end(),m,e,match_extra);
  if(successful)
  {
    float f;
    vector<int> &i_list = myPair.second;
    f = atof(m[1].c_str());
    myPair.first=f;
    for(int i = 2; i < what.captures(2).size(); ++i)
    {
      i_list.push_back(atoi(m.captures(2)[i]));
    }
  }
  return successful;
}

boost::xpressive(dynamic): // this uses the same back-end as
std/tr1/boost::regex, except you can freely mix-n-match dynamic and
static xpressive regex's, so refer to std/tr1/boost::regex

boost::xpressive(static):  // This will be faster then all of the
above methods as the regex parse tree is created at compile-time
instead of run-time
bool parse_test(std::string &testStr, myPair &ret)
{
  sregex e = (s1=(!as_xpr('-')>>+_d>>!('.'>>*_d)))[ref(ret.first)=atof_adapt(s1)]
    >> '|'>> (s2=*_d)[push_back(ref(ret.second),atoi_adapt(s2))] >>
*(',' >> (s2=*_d)[push_back(ref(ret.second),atoi_adapt(s2))])
  bool successful = regex_match(testStr.begin(),testStr.end(),e,match_extra);
  return successful;
}

spirit::qi:  // This will be the fastest, quite literally no other
parser has got close to the execution speed that this one can parse at
for anything like this, and it is easy to write!
bool parse_test(std::string &testStr, myPair &ret)
{
  return parse(testStr.begin(),testStr.end(),
    float_ >> '|' >> int_%','
    ret);
}

And as stated, the spirit version will execute faster, is very easy to
write, is very readable, and has adapters so it can stuff into
everything from any stl container (or anything that supports insert or
push_back for containers) or any generic struct/class (thanks to
fusion), and it is simple to create new things to do just about
anything you want.
Even something as simple as
parse(testStr.begin(),testStr.end(),int_,myInt); it executes faster
then atoi.  And of course, you cannot beat the license.

Daniel just responded, so let me add based on his post.  I know regex
was slow, but if it is as slow as he is saying, that is even more
incentive not to use it, not just based on how unusable it is on some
platforms.  If you want speed, I dare you to show me anything that
beats Spirit2.1, even ignoring its ease of use and the fact it works
everywhere.