[LLVMdev] Regular Expression lib support

Sun Aug 23 21:28:18 PDT 2009

On Aug 23, 2009, at 9:01 PM, Daniel Berlin wrote:
>>  2. Use POSIX regcomp facilities. This implies importing some
>> implementation of this interface, e.g., Windows. On Linux, BSD, etc.
>> we would try to use the platform version if available (and non- 
>> buggy).
>
> Don't do it.
> They are ridiculous slow, and posix made some really dumb choices in  
> regexps.

We want to use this from FileCheck, which we build at -O0 today.   
Also, each regex will be matched once.  Most testcases use fixed  
strings (in fact 100% of them do today!).  This really is not very  
performance sensitive.

Regex engines like this are inherently more powerful but slower than  
fixed-purpose matching logic.  I don't see a reason not to use a  
(slow!) simple regexec version.

I would also prefer not to have all the crazy features.  Just  
supporting simple matching stuff is perfectly acceptable.  We don't  
need unicode character classes, negative assertions, etc.  We don't  
need the full power of perl regex's.

>> My personal preference is #2, and I may work on this in the near- 
>> term,
>> but I wanted to check for conflicting opinions first.
>>
>> My main reasons for preferring #2 is to use a standard interface and
>> language, and it should be relatively easy to implement and maintain
>> (I'm assuming there is some nice BSD implementation out their we can
>> snarf, I haven't actually validated this assumption).
> Warning: All the builtin platform ones are actually fairly slow, and
> some are quite buggy, even between different versions of the same
> platform (IE different glibc versions, etc)

I am more concerned about bugginess, but I doubt that affects simple  
regexes.

> That said, if you really have the urge to have a BSD'd implementation
> of regexec, at least choose something like http://laurikari.net/tre/
> which is a linear time in size of string.

Nice, we need something for windows.

> The Google impl was, at least if i remember right, made up of 3
> engines: 1 that was a special case matcher that handled plain old
> anchored strings super fast, 1 that dealt with regular expressions
> that did not require backtracking, and 1 that dealt with regular
> expressions that did.

FWIW, I plan to have different syntax in filecheck for fixed string  
matching vs regex matching.  This should eliminate most of the common  
reasons for needing "leaning toothpicks".

> If you are curious, see http://swtch.com/~rsc/regexp/ for how bad the
> other libraries can get on fairly simple regexps because they don't
> use linear time matchers.

Yep, I'm very familiar with that paper.  Thanks Danny,

-Chris