[cfe-dev] [libcxx] Multiline regular expression matching

Howard Hinnant hhinnant at apple.com
Sat Mar 26 13:07:04 PDT 2011


On Mar 23, 2011, at 9:42 PM, Jonathan Sauer wrote:

> Hello,
> 
> N3242 is silent on the issue of multiline regular expression matching, i.e. if ^ and $ only match the beginning
> and end of the string, respectively, or if they also match occurrances of \n or \r inside the string. It is only
> possible to turn matching the former off (via match_not_bol and match_not_eol, respectively). ECMAScript, on
> whose regular expressions the C++0x regex library is based (among others), provides an additional flag in the
> RegExp constructor to turn on multiline matching (see also <http://www.regular-expressions.info/anchors.html>
> for more information on multiline matching).
> 
> I looked into previous standard committee documents about regular expressions, but was unable to find anything
> regarding this issue.
> 
> I then tried my hand at a workaround using a non-captured disjunction in the following test program, using
> libc++ trunk:
> 
> // /opt/bin/clang -std=c++0x -stdlib=libc++ -lc++ clang.cpp
> #include <regex>
> 
> static const std::regex	INCLUDE_REGEXP("(?:^|[\\n\\r])#include\\s*<([^>]+)>");
> 
> static const std::string	s = 
> 	"attribute vec3	vertexUV0;\n"
> 	"#include <shaders/include/Lighting.glsl>\n"
> 	"#include <shaders/include/ProjectTextureOnCube.glsl>\n"
> 	"uniform mat4	mvp;\n";
> 
> 
> int main(int, char**)
> {
> 	std::sregex_iterator		it(s.begin(), s.end(), INCLUDE_REGEXP);
> 	std::sregex_iterator const	end;
> 	if (it == end)
> 	{
> 		std::printf("Not found\n");
> 	}
> 	else
> 	{
> 		while (it != end)
> 		{
> 			std::printf("Found '%s [%s]'\n", it->str().c_str(), it->str(1).c_str());
> 			++it;
> 		}
> 	}
> }
> 
> 
> This resulted in the output
> 
> 	"Not found".
> 
> Exchanging the disjunction's alternatives ("(?:^|[\\n\\r])" => "(?:[\\n\\r]|^)"), resulted in a (seemingly)
> endless stream of
> 	Found ' []'
> 	Found ' []'
> 	Found ' []'
> 	Found ' []'
> 	...
> 
> Removing the disjunction results in two matches (as expected):
> 	Found '#include <shaders/include/Lighting.glsl> [shaders/include/Lighting.glsl]'
> 	Found '#include <shaders/include/ProjectTextureOnCube.glsl> [shaders/include/ProjectTextureOnCube.glsl]'
> 
> 
>> From my reading of the ECMAScript standard (<http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf>),
> above regular expressions are (at least syntactically) valid. So I have the following questions:
> 
> - Is libc++'s current behaviour a bug?

I believe it is a libc++ bug.  I've committed a fix revision 128350.

> - Is there another, simpler way to perform multiline matching using std::regex?

Your way looks as good as any to me.

Thanks for bringing this to our attention.

-Howard




More information about the cfe-dev mailing list