[LLVMdev] Regular Expression lib support

Török Edwin edwintorok at gmail.com
Sun Aug 23 23:59:09 PDT 2009


On 2009-08-24 07:28, Chris Lattner wrote:
> On Aug 23, 2009, at 9:01 PM, Daniel Berlin wrote:
>   
>>>  2. Use POSIX regcomp facilities. This implies importing some
>>> implementation of this interface, e.g., Windows. On Linux, BSD, etc.
>>> we would try to use the platform version if available (and non- 
>>> buggy).
>>>       
>> Don't do it.
>> They are ridiculous slow, and posix made some really dumb choices in  
>> regexps.
>>     
>
> We want to use this from FileCheck, which we build at -O0 today.   
> Also, each regex will be matched once.  Most testcases use fixed  
> strings (in fact 100% of them do today!).  This really is not very  
> performance sensitive.
>
> Regex engines like this are inherently more powerful but slower than  
> fixed-purpose matching logic.  I don't see a reason not to use a  
> (slow!) simple regexec version.
>   

I agree with Daniel Berlin, system's regcomp/regexec shouldn't be used.

Slow can mean taking 30 minutes, or hanging indefinetely on some
platforms with buggy regcomp/regexec.

Some examples:
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=497
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=598
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=635
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=658
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=679
http://bugs.opensolaris.org/bugdatabase/printableBug.do?bug_id=4346175

That is why in ClamAV we are using the OpenBSD implementation of
regcomp/regexec, regardless if the system has a regcomp/regexec available.
The code is fairly small (~100k), BSD licensed, easy to make it portable
(memmove->memcpy, pull in strlcpy impl., etc.) and doesn't explode in
execution time.

I wasn't aware of Google's regexp library at that time (perhaps it
didn't exist yet), but if it provides linear execution time that sounds
good too.

If LLVM is going to have an integrated regex library I suggest using it
regardless if the platform has one.
The LLVM integrated regex library will provide consistent behaviour and
execution time, the system one will not.

Best regards,
--Edwin



More information about the llvm-dev mailing list