[cfe-dev] [libcxx] Multiline regular expression matching

Jonathan Sauer jonathan.sauer at gmx.de
Wed Mar 23 13:42:15 PDT 2011


Hello,

N3242 is silent on the issue of multiline regular expression matching, i.e. if ^ and $ only match the beginning
and end of the string, respectively, or if they also match occurrances of \n or \r inside the string. It is only
possible to turn matching the former off (via match_not_bol and match_not_eol, respectively). ECMAScript, on
whose regular expressions the C++0x regex library is based (among others), provides an additional flag in the
RegExp constructor to turn on multiline matching (see also <http://www.regular-expressions.info/anchors.html>
for more information on multiline matching).

I looked into previous standard committee documents about regular expressions, but was unable to find anything
regarding this issue.

I then tried my hand at a workaround using a non-captured disjunction in the following test program, using
libc++ trunk:

// /opt/bin/clang -std=c++0x -stdlib=libc++ -lc++ clang.cpp
#include <regex>

static const std::regex	INCLUDE_REGEXP("(?:^|[\\n\\r])#include\\s*<([^>]+)>");

static const std::string	s = 
	"attribute vec3	vertexUV0;\n"
	"#include <shaders/include/Lighting.glsl>\n"
	"#include <shaders/include/ProjectTextureOnCube.glsl>\n"
	"uniform mat4	mvp;\n";


int main(int, char**)
{
	std::sregex_iterator		it(s.begin(), s.end(), INCLUDE_REGEXP);
	std::sregex_iterator const	end;
	if (it == end)
	{
		std::printf("Not found\n");
	}
	else
	{
		while (it != end)
		{
			std::printf("Found '%s [%s]'\n", it->str().c_str(), it->str(1).c_str());
			++it;
		}
	}
}


This resulted in the output

	"Not found".

Exchanging the disjunction's alternatives ("(?:^|[\\n\\r])" => "(?:[\\n\\r]|^)"), resulted in a (seemingly)
endless stream of
	Found ' []'
	Found ' []'
	Found ' []'
	Found ' []'
	...

Removing the disjunction results in two matches (as expected):
	Found '#include <shaders/include/Lighting.glsl> [shaders/include/Lighting.glsl]'
	Found '#include <shaders/include/ProjectTextureOnCube.glsl> [shaders/include/ProjectTextureOnCube.glsl]'


>From my reading of the ECMAScript standard (<http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf>),
above regular expressions are (at least syntactically) valid. So I have the following questions:

- Is libc++'s current behaviour a bug?
- Is there another, simpler way to perform multiline matching using std::regex?


Thanks in advance,
Jonathan





More information about the cfe-dev mailing list