[LLVMdev] Regular Expression lib support

Daniel Dunbar daniel at zuster.org
Tue Aug 25 11:18:46 PDT 2009


Woot! Thanks a bunch Edwin!

Some comments on the patch:
--
I'm not sure if it makes sense to import the man pages, if we only
expose Regex.h.

> diff --git a/include/llvm/Support/Regex.h b/include/llvm/Support/Regex.h
> new file mode 100644
> index 0000000..314bff4
> --- /dev/null
> +++ b/include/llvm/Support/Regex.h
> @@ -0,0 +1,49 @@
> +//===-- Regex.h - Regular Expression matcher implementation -------------===//

Include C++ marker please (*- C++ -*-===//)

> +//
> +//                     The LLVM Compiler Infrastructure
> +//
> +// This file is distributed under the University of Illinois Open Source
> +// License. See LICENSE.TXT for details.
> +//
> +//===----------------------------------------------------------------------===//
> +//
> +// This file implements a POSIX regular expression matcher.
> +//
> +//===----------------------------------------------------------------------===//
> +#include "llvm/Support/DataTypes.h"
> +#include "llvm/Support/regex.h"

I would like to get rid of this include. What if match_sub just took a
SmallVector<std::pair<size_t, size_t>> to use for the matchs (or some
equivalent). I think that the regex n_subs field has the number of matches, so
match_sub can do the right thing.

> +namespace llvm {
> +  class Regex {
> +  public:
> +    enum {
> +      // Compile with recognition of all special characters turned off.
> +      // All characters are thus considered ordinary,
> +      // so the RE is a literal string.
> +      NOSPEC=1,
> +      // Compile for matching that ignores upper/lower case distinctions.
> +      ICASE=2,
> +      // Compile for matching that need only report success or failure,
> +      // not what was matched.
> +      NOSUB=4,
> +      // Compile for newline-sensitive matching. With this flag '[^' bracket
> +      // expressions and '.' never match newline. A ^ anchor matches the
> +      // null string after any newline in the string in addition to its normal
> +      // function, and the $ anchor matches the null string before any
> +      // newline in the string in addition to its normal function.
> +      NEWLINE=8
> +    };

Can you make these doxyments? Also, I'd prefer more LLVM style and less Unix
style enum names, IgnoreCase for example.

Does NOSPEC actually ever make sense to use? Wouldn't clients just use string
compare?

> +
> +    Regex(const char *regex, unsigned Flags=NOSUB);
> +    ~Regex();

I think this should have an "bool isValid(std::string &Error)" or similar
method, which would return the regex error if one was found during construction.

> +    // notbol: The match-beginning-of-line operator always fails to match,
> +    //    unless regex was compiled with NEWLINE flag.
> +    // noteol:  The match-end-of-line operator always fails to match.
> +    //    unless regex was compiled with NEWLINE flag.
> +
> +    bool matches(const char *string, bool notbol=false, bool noteol=false);
> +    bool match_sub(const char *string, llvm_regmatch_t pmatch[],
> +                   unsigned nmatch, bool notbol=false, bool noteol=false);

These should be doxymented, and I don't understand the flags. Does this mean
that if I *don't* compile with NEWLINE, but *do* want ^/$ support, then
notbol/noteol need to be false? Do we care about use cases that care about this?
Could we just always have them as false?

As mentioned before I think we can change the matches to be returned in a small
vector or similar and avoid the dependency on llvm_regmatch_t.

What about this:


/// match - Match the regex against the given \arg String.
///
/// \param Matches - If given, on a successful match this will be filled in with
/// references to the matched group expressions (inside \arg String).
/// \param AllowBeginOfLine - ... again I think we should kill this if
possible ...
/// \param AllowEndOfLine - ... again I think we should kill this if
possible ...
/// \return - True on successful match.
bool match(const char *String, SmallVectorImpl<StringRef> *Matches,
             bool AllowBeginOfLine = true, bool AllowEndOfLine = true);

> +    bool match_sub(const char *string, llvm_regmatch_t pmatch[],
> +                   unsigned nmatch, bool notbol=false, bool noteol=false);


> +  private:
> +    llvm_regex_t preg;
> +  };
> +}

> diff --git a/include/llvm/Support/regex.h b/include/llvm/Support/regex.h

This won't work on Windows as stands (Regex.h == regex.h), but I think we should
just keep regex.h private which solves the problem (still nice to rename it to
avoid confusion, llvm_regex.h would be consistent).

Otherwise this looks great to me and compiled file on Darwin. I too am of the
opinion that we should avoid all configure magicness and just always use this, I
don't see substantial value in trying to use the operating system one. If we
did, it would be an obvious extension to Regex.cpp, and nothing else.
--

Thanks again for working on this!!!

 - Daniel

2009/8/25 Török Edwin <edwintorok at gmail.com>:
> On 2009-08-24 20:14, Chris Lattner wrote:
>> On Aug 23, 2009, at 11:59 PM, Török Edwin wrote:
>>> If LLVM is going to have an integrated regex library I suggest using it
>>> regardless if the platform has one.
>>> The LLVM integrated regex library will provide consistent behaviour and
>>> execution time, the system one will not.
>>
>> Hi Edwin,
>>
>> Can you propose the openbsd implementation as a patch to lib/support?
>>
>> -Chris
>
> Attached is a proposed patch.
> The Regex class is a fairly simple wrapper around the
> llvm_regcomp/llvm_regexec functions,
> maybe the interface could be improved, right now I just focused on
> getting it to work.
> Also the OpenBSD implementation has some extensions (see docs/regex.7),
> if they are useful those can be added to the Regex class also.
>
> A summary of changes from OpenBSD version:
>  - rename functions and types to avoid clashes with system one:
> regcomp->llvm_regcomp, ..., regex_t -> llvm_regex_t, regoff_t ->
> llvm_regoff_t, BAD -> REGEX_BAD ...
>  - include strlcpy.c (llvm_strlcpy), a configure check could be added to
> use system's one, but the function is so simple I just unconditionally
> always use it
>  - memcpy -> memmove (parameters may overlap, BSD memcpy allows that but
> POSIX doesn't)
>  - COPYRIGHT -> COPYRIGHT.regex, engine.c -> engine.inc, regex.h ->
> llvm/Support/regex.h
>  - char* -> const char* to avoid compiler warnings
>  - #define DUPMAX
>
> There is also a unittests/Support/RegexTest.cpp that does some very
> simple tests.
>
> Let me know how it works on your platform.
>
> Best regards,
> --Edwin
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>




More information about the llvm-dev mailing list