[LLVMdev] Regular Expression lib support

Török Edwin edwintorok at gmail.com
Tue Aug 25 13:56:12 PDT 2009


On 2009-08-25 21:18, Daniel Dunbar wrote:
> Woot! Thanks a bunch Edwin!
>
> Some comments on the patch:
> --
> I'm not sure if it makes sense to import the man pages, if we only
> expose Regex.h.
>   

I'd like to keep re_format.7, it describes the format of the regex as
accepted by this implementation.
I'll remove regex.3 since its not exposed.

>   
>> diff --git a/include/llvm/Support/Regex.h b/include/llvm/Support/Regex.h
>> new file mode 100644
>> index 0000000..314bff4
>> --- /dev/null
>> +++ b/include/llvm/Support/Regex.h
>> @@ -0,0 +1,49 @@
>> +//===-- Regex.h - Regular Expression matcher implementation -------------===//
>>     
>
> Include C++ marker please (*- C++ -*-===//)
>
>   
>> +//
>> +//                     The LLVM Compiler Infrastructure
>> +//
>> +// This file is distributed under the University of Illinois Open Source
>> +// License. See LICENSE.TXT for details.
>> +//
>> +//===----------------------------------------------------------------------===//
>> +//
>> +// This file implements a POSIX regular expression matcher.
>> +//
>> +//===----------------------------------------------------------------------===//
>> +#include "llvm/Support/DataTypes.h"
>> +#include "llvm/Support/regex.h"
>>     
>
> I would like to get rid of this include. What if match_sub just took a
> SmallVector<std::pair<size_t, size_t>> to use for the matchs (or some
> equivalent). I think that the regex n_subs field has the number of matches, so
> match_sub can do the right thing.
>   

Ok.

>   
>> +namespace llvm {
>> +  class Regex {
>> +  public:
>> +    enum {
>> +      // Compile with recognition of all special characters turned off.
>> +      // All characters are thus considered ordinary,
>> +      // so the RE is a literal string.
>> +      NOSPEC=1,
>> +      // Compile for matching that ignores upper/lower case distinctions.
>> +      ICASE=2,
>> +      // Compile for matching that need only report success or failure,
>> +      // not what was matched.
>> +      NOSUB=4,
>> +      // Compile for newline-sensitive matching. With this flag '[^' bracket
>> +      // expressions and '.' never match newline. A ^ anchor matches the
>> +      // null string after any newline in the string in addition to its normal
>> +      // function, and the $ anchor matches the null string before any
>> +      // newline in the string in addition to its normal function.
>> +      NEWLINE=8
>> +    };
>>     
>
> Can you make these doxyments? Also, I'd prefer more LLVM style and less Unix
> style enum names, IgnoreCase for example.
>
> Does NOSPEC actually ever make sense to use? Wouldn't clients just use string
> compare?
>   

Indeed, it is worthless for LLVM.
There's also REG_PEND (non-POSIX, but this impl. supports it) that
allows matching patterns
with embedded NULs. Would that be needed/useful for LLVM?

>   
>> +
>> +    Regex(const char *regex, unsigned Flags=NOSUB);
>> +    ~Regex();
>>     
>
> I think this should have an "bool isValid(std::string &Error)" or similar
> method, which would return the regex error if one was found during construction.
>   

Ok.

>   
>> +    // notbol: The match-beginning-of-line operator always fails to match,
>> +    //    unless regex was compiled with NEWLINE flag.
>> +    // noteol:  The match-end-of-line operator always fails to match.
>> +    //    unless regex was compiled with NEWLINE flag.
>> +
>> +    bool matches(const char *string, bool notbol=false, bool noteol=false);
>> +    bool match_sub(const char *string, llvm_regmatch_t pmatch[],
>> +                   unsigned nmatch, bool notbol=false, bool noteol=false);
>>     
>
> These should be doxymented, and I don't understand the flags. Does this mean
> that if I *don't* compile with NEWLINE, but *do* want ^/$ support, then
> notbol/noteol need to be false? Do we care about use cases that care about this?
> Could we just always have them as false?
>   

They are meant for cases where you call matches() on portions of a
larger string.
When you call it at offset 0, you want notbol=false, but when you call
it at offset > 0 you want notbol=true.
Ditto for noteol, when the portion of string passed to matches() doesn't
contain the real end of the string.

If we are always passing the entire string we intend to match to
matches() then we can remove those bools.

This is the manpage:
REG_NEWLINE
"Compile for newline-sensitive matching. By default, newline is a
completely ordinary character with no special
meaning in either REs or strings. With this flag,. [^ bracket
expressions and . never match newline,
a ^ achor matches the null string after any newline in the string in
addition to its normal function,
and the $ anchor matches the null string before any newline in the
string in addition to its normal function.

REG_NOTBOL
The first character of the stringis not the beginning of a line, so the
^ anchor should not match before it.
This does not affect the behavior of newlines under REG_NEWLINE .
REG_NOTEOL
The NUL terminating the string does not end a line, so the $ anchor
should not match before it.
This does not affect the behavior of newlines under REG_NEWLINE .

> As mentioned before I think we can change the matches to be returned in a small
> vector or similar and avoid the dependency on llvm_regmatch_t.
>
>   

Yes, that sounds good.

> What about this:
>
>
> /// match - Match the regex against the given \arg String.
> ///
> /// \param Matches - If given, on a successful match this will be filled in with
> /// references to the matched group expressions (inside \arg String).
> /// \param AllowBeginOfLine - ... again I think we should kill this if
> possible ...
> /// \param AllowEndOfLine - ... again I think we should kill this if
> possible ...
> /// \return - True on successful match.
> bool match(const char *String, SmallVectorImpl<StringRef> *Matches,
>              bool AllowBeginOfLine = true, bool AllowEndOfLine = true);
>
>   
>> +    bool match_sub(const char *string, llvm_regmatch_t pmatch[],
>> +                   unsigned nmatch, bool notbol=false, bool noteol=false);
>>     
>
>
>   
>> +  private:
>> +    llvm_regex_t preg;
>> +  };
>> +}
>>     
>
>   
>> diff --git a/include/llvm/Support/regex.h b/include/llvm/Support/regex.h
>>     
>
> This won't work on Windows as stands (Regex.h == regex.h), but I think we should
> just keep regex.h private which solves the problem (still nice to rename it to
> avoid confusion, llvm_regex.h would be consistent).
>   

Ok, I'll try to make it private.

> Otherwise this looks great to me and compiled file on Darwin. I too am of the
> opinion that we should avoid all configure magicness and just always use this, I
> don't see substantial value in trying to use the operating system one. If we
> did, it would be an obvious extension to Regex.cpp, and nothing else.
>   

I don't see any value in using the system provided one either.


Best regards,
--Edwin




More information about the llvm-dev mailing list