[LLVMdev] Regular Expression lib support

Daniel Berlin dberlin at dberlin.org
Sun Aug 23 21:01:44 PDT 2009


On Sun, Aug 23, 2009 at 5:56 PM, Daniel Dunbar<daniel at zuster.org> wrote:
> We would like to have access to some kind of regular expression
> library inside LLVM. For example, we need this to extend the FileCheck
> test case checking tool to support regular expressions.
>
> There are three obvious options:
>  1. Roll our own library. Multiple unnamed individuals may even
> already have implementations lying around! :)
>

>  2. Use POSIX regcomp facilities. This implies importing some
> implementation of this interface, e.g., Windows. On Linux, BSD, etc.
> we would try to use the platform version if available (and non-buggy).

Don't do it.
They are ridiculous slow, and posix made some really dumb choices in regexps.

>
>  3. Import a more heavy weight library such as PCRE, and use it universally.
PCRE is actually quite slow too.
Google released (as part of V8, I believe) an almost-always
linear-time regular expression library (written by Russ Cox and Rob
Pike).

It's quite small code wise (unlike boost), quite fast (and supports
perl style regexps).  I'll see if i can get a pointer to it.

>
> My personal preference is #2, and I may work on this in the near-term,
> but I wanted to check for conflicting opinions first.
>
> My main reasons for preferring #2 is to use a standard interface and
> language, and it should be relatively easy to implement and maintain
> (I'm assuming there is some nice BSD implementation out their we can
> snarf, I haven't actually validated this assumption).
Warning: All the builtin platform ones are actually fairly slow, and
some are quite buggy, even between different versions of the same
platform (IE different glibc versions, etc) Google went through this
before originally settling on something that was at least consistently
buggy on all platforms (PCRE) and then moving to writing one that was
much better (what was released in V8, though i think that one may be
different than the exact one used internally).  PCRE could easily blow
the stack in some cases, sadly.

FWIW: Most people are much more familiar with perl  tyle regexps, than
they are familiar with posix style.

That said, if you really have the urge to have a BSD'd implementation
of regexec, at least choose something like http://laurikari.net/tre/
which is a linear time in size of string.

I've never performance timed it myself, but Russ Cox says it is "efficient" ;)

The Google impl was, at least if i remember right, made up of 3
engines: 1 that was a special case matcher that handled plain old
anchored strings super fast, 1 that dealt with regular expressions
that did not require backtracking, and 1 that dealt with regular
expressions that did.

If you are curious, see http://swtch.com/~rsc/regexp/ for how bad the
other libraries can get on fairly simple regexps because they don't
use linear time matchers.




More information about the llvm-dev mailing list