Skip to content

Commit 94da919

Browse files
avargitster
authored andcommitted
grep: add support for PCRE v2
Add support for v2 of the PCRE API. This is a new major version of PCRE that came out in early 2015[1]. The regular expression syntax is the same, but while the API is similar, pretty much every function is either renamed or takes different arguments. Thus using it via entirely new functions makes sense, as opposed to trying to e.g. have one compile_pcre_pattern() that would call either PCRE v1 or v2 functions. Git can now be compiled with either USE_LIBPCRE1=YesPlease or USE_LIBPCRE2=YesPlease, with USE_LIBPCRE=YesPlease currently being a synonym for the former. Providing both is a compile-time error. With earlier patches to enable JIT for PCRE v1 the performance of the release versions of both libraries is almost exactly the same, with PCRE v2 being around 1% slower. However after I reported this to the pcre-dev mailing list[2] I got a lot of help with the API use from Zoltán Herczeg, he subsequently optimized some of the JIT functionality in v2 of the library. Running the p7820-grep-engines.sh performance test against the latest Subversion trunk of both, with both them and git compiled as -O3, and the test run against linux.git, gives the following results. Just the /perl/ tests shown: $ GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_COMMAND='grep -q LIBPCRE2 Makefile && make -j8 USE_LIBPCRE2=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre2/inst/lib || make -j8 USE_LIBPCRE=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre/inst/lib' ./run HEAD~5 HEAD~ HEAD p7820-grep-engines.sh [...] Test HEAD~5 HEAD~ HEAD ----------------------------------------------------------------------------------------------------------------- 7820.3: perl grep 'how.to' 0.31(1.10+0.48) 0.21(0.35+0.56) -32.3% 0.21(0.34+0.55) -32.3% 7820.7: perl grep '^how to' 0.56(2.70+0.40) 0.24(0.64+0.52) -57.1% 0.20(0.28+0.60) -64.3% 7820.11: perl grep '[how] to' 0.56(2.66+0.38) 0.29(0.95+0.45) -48.2% 0.23(0.45+0.54) -58.9% 7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 1.02(5.77+0.42) 0.31(1.02+0.54) -69.6% 0.23(0.50+0.54) -77.5% 7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.38(1.57+0.42) 0.27(0.85+0.46) -28.9% 0.21(0.33+0.57) -44.7% See commit ("perf: add a comparison test of grep regex engines", 2017-04-19) for details on the machine the above test run was executed on. Here HEAD~2 is git with PCRE v1 without JIT, HEAD~ is PCRE v1 with JIT, and HEAD is PCRE v2 (also with JIT). See previous commits of mine mentioning p7820-grep-engines.sh for more details on the test setup. For ease of readability, a different run just of HEAD~ (PCRE v1 with JIT v.s. PCRE v2), again with just the /perl/ tests shown: [...] Test HEAD~ HEAD ---------------------------------------------------------------------------------------- 7820.3: perl grep 'how.to' 0.21(0.42+0.52) 0.21(0.31+0.58) +0.0% 7820.7: perl grep '^how to' 0.25(0.65+0.50) 0.20(0.31+0.57) -20.0% 7820.11: perl grep '[how] to' 0.30(0.90+0.50) 0.23(0.46+0.53) -23.3% 7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 0.30(1.19+0.38) 0.23(0.51+0.51) -23.3% 7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.27(0.84+0.48) 0.21(0.34+0.57) -22.2% I.e. the two are either neck-to-neck, but PCRE v2 usually pulls ahead, when it does it's around 20% faster. A brief note on thread safety: As noted in pcre2api(3) & pcre2jit(3) the compiled pattern can be shared between threads, but not some of the JIT context, however the grep threading support does all pattern & JIT compilation in separate threads, so this code doesn't need to concern itself with thread safety. See commit 63e7e9d ("git-grep: Learn PCRE", 2011-05-09) for the initial addition of PCRE v1. This change follows some of the same patterns it did (and which were discussed on list at the time), e.g. mocking up types with typedef instead of ifdef-ing them out when USE_LIBPCRE2 isn't defined. This adds some trivial memory use to the program, but makes the code look nicer. 1. https://lists.exim.org/lurker/message/20150105.162835.0666407a.en.html 2. https://lists.exim.org/lurker/thread/20170419.172322.833ee099.en.html Signed-off-by: Ævar Arnfjörð Bjarmason <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent fb95e2e commit 94da919

File tree

5 files changed

+256
-21
lines changed

5 files changed

+256
-21
lines changed

Makefile

Lines changed: 28 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,11 @@ all::
2929
# Perl-compatible regular expressions instead of standard or extended
3030
# POSIX regular expressions.
3131
#
32+
# Currently USE_LIBPCRE is a synonym for USE_LIBPCRE1, define
33+
# USE_LIBPCRE2 instead if you'd like to use version 2 of the PCRE
34+
# library. The USE_LIBPCRE flag will likely be changed to mean v2 by
35+
# default in future releases.
36+
#
3237
# When using USE_LIBPCRE1, define NO_LIBPCRE1_JIT if the PCRE v1
3338
# library is compiled without --enable-jit. We will auto-detect
3439
# whether the version of the PCRE v1 library in use has JIT support at
@@ -37,8 +42,10 @@ all::
3742
# you have link-time errors about a missing `pcre_jit_exec` define
3843
# this, or recompile PCRE v1 with --enable-jit.
3944
#
40-
# Define LIBPCREDIR=/foo/bar if your libpcre header and library files are in
41-
# /foo/bar/include and /foo/bar/lib directories.
45+
# Define LIBPCREDIR=/foo/bar if your PCRE header and library files are
46+
# in /foo/bar/include and /foo/bar/lib directories. Which version of
47+
# PCRE this points to determined by the USE_LIBPCRE1 and USE_LIBPCRE2
48+
# variables.
4249
#
4350
# Define HAVE_ALLOCA_H if you have working alloca(3) defined in that header.
4451
#
@@ -1095,19 +1102,31 @@ ifdef NO_LIBGEN_H
10951102
COMPAT_OBJS += compat/basename.o
10961103
endif
10971104

1098-
ifdef USE_LIBPCRE
1099-
BASIC_CFLAGS += -DUSE_LIBPCRE1
1100-
ifdef LIBPCREDIR
1101-
BASIC_CFLAGS += -I$(LIBPCREDIR)/include
1102-
EXTLIBS += -L$(LIBPCREDIR)/$(lib) $(CC_LD_DYNPATH)$(LIBPCREDIR)/$(lib)
1105+
USE_LIBPCRE1 ?= $(USE_LIBPCRE)
1106+
1107+
ifneq (,$(USE_LIBPCRE1))
1108+
ifdef USE_LIBPCRE2
1109+
$(error Only set USE_LIBPCRE1 (or its alias USE_LIBPCRE) or USE_LIBPCRE2, not both!)
11031110
endif
1111+
1112+
BASIC_CFLAGS += -DUSE_LIBPCRE1
11041113
EXTLIBS += -lpcre
11051114

11061115
ifdef NO_LIBPCRE1_JIT
11071116
BASIC_CFLAGS += -DNO_LIBPCRE1_JIT
11081117
endif
11091118
endif
11101119

1120+
ifdef USE_LIBPCRE2
1121+
BASIC_CFLAGS += -DUSE_LIBPCRE2
1122+
EXTLIBS += -lpcre2-8
1123+
endif
1124+
1125+
ifdef LIBPCREDIR
1126+
BASIC_CFLAGS += -I$(LIBPCREDIR)/include
1127+
EXTLIBS += -L$(LIBPCREDIR)/$(lib) $(CC_LD_DYNPATH)$(LIBPCREDIR)/$(lib)
1128+
endif
1129+
11111130
ifdef HAVE_ALLOCA_H
11121131
BASIC_CFLAGS += -DHAVE_ALLOCA_H
11131132
endif
@@ -2252,7 +2271,8 @@ GIT-BUILD-OPTIONS: FORCE
22522271
@echo TAR=\''$(subst ','\'',$(subst ','\'',$(TAR)))'\' >>$@+
22532272
@echo NO_CURL=\''$(subst ','\'',$(subst ','\'',$(NO_CURL)))'\' >>$@+
22542273
@echo NO_EXPAT=\''$(subst ','\'',$(subst ','\'',$(NO_EXPAT)))'\' >>$@+
2255-
@echo USE_LIBPCRE1=\''$(subst ','\'',$(subst ','\'',$(USE_LIBPCRE)))'\' >>$@+
2274+
@echo USE_LIBPCRE1=\''$(subst ','\'',$(subst ','\'',$(USE_LIBPCRE1)))'\' >>$@+
2275+
@echo USE_LIBPCRE2=\''$(subst ','\'',$(subst ','\'',$(USE_LIBPCRE2)))'\' >>$@+
22562276
@echo NO_LIBPCRE1_JIT=\''$(subst ','\'',$(subst ','\'',$(NO_LIBPCRE1_JIT)))'\' >>$@+
22572277
@echo NO_PERL=\''$(subst ','\'',$(subst ','\'',$(NO_PERL)))'\' >>$@+
22582278
@echo NO_PTHREADS=\''$(subst ','\'',$(subst ','\'',$(NO_PTHREADS)))'\' >>$@+

configure.ac

Lines changed: 65 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -255,21 +255,61 @@ GIT_PARSE_WITH([openssl]))
255255
# Perl-compatible regular expressions instead of standard or extended
256256
# POSIX regular expressions.
257257
#
258-
# Define LIBPCREDIR=/foo/bar if your libpcre header and library files are in
258+
# Currently USE_LIBPCRE is a synonym for USE_LIBPCRE1, define
259+
# USE_LIBPCRE2 instead if you'd like to use version 2 of the PCRE
260+
# library. The USE_LIBPCRE flag will likely be changed to mean v2 by
261+
# default in future releases.
262+
#
263+
# Define LIBPCREDIR=/foo/bar if your PCRE header and library files are in
259264
# /foo/bar/include and /foo/bar/lib directories.
260265
#
261266
AC_ARG_WITH(libpcre,
262-
AS_HELP_STRING([--with-libpcre],[support Perl-compatible regexes (default is NO)])
267+
AS_HELP_STRING([--with-libpcre],[synonym for --with-libpcre1]),
268+
if test "$withval" = "no"; then
269+
USE_LIBPCRE1=
270+
elif test "$withval" = "yes"; then
271+
USE_LIBPCRE1=YesPlease
272+
else
273+
USE_LIBPCRE1=YesPlease
274+
LIBPCREDIR=$withval
275+
AC_MSG_NOTICE([Setting LIBPCREDIR to $LIBPCREDIR])
276+
dnl USE_LIBPCRE1 can still be modified below, so don't substitute
277+
dnl it yet.
278+
GIT_CONF_SUBST([LIBPCREDIR])
279+
fi)
280+
281+
AC_ARG_WITH(libpcre1,
282+
AS_HELP_STRING([--with-libpcre1],[support Perl-compatible regexes via libpcre1 (default is NO)])
283+
AS_HELP_STRING([], [ARG can be also prefix for libpcre library and headers]),
284+
if test "$withval" = "no"; then
285+
USE_LIBPCRE1=
286+
elif test "$withval" = "yes"; then
287+
USE_LIBPCRE1=YesPlease
288+
else
289+
USE_LIBPCRE1=YesPlease
290+
LIBPCREDIR=$withval
291+
AC_MSG_NOTICE([Setting LIBPCREDIR to $LIBPCREDIR])
292+
dnl USE_LIBPCRE1 can still be modified below, so don't substitute
293+
dnl it yet.
294+
GIT_CONF_SUBST([LIBPCREDIR])
295+
fi)
296+
297+
AC_ARG_WITH(libpcre2,
298+
AS_HELP_STRING([--with-libpcre2],[support Perl-compatible regexes via libpcre2 (default is NO)])
263299
AS_HELP_STRING([], [ARG can be also prefix for libpcre library and headers]),
300+
if test -n "$USE_LIBPCRE1"; then
301+
AC_MSG_ERROR([Only supply one of --with-libpcre1 or --with-libpcre2!])
302+
fi
303+
264304
if test "$withval" = "no"; then
265-
USE_LIBPCRE=
305+
USE_LIBPCRE2=
266306
elif test "$withval" = "yes"; then
267-
USE_LIBPCRE=YesPlease
307+
USE_LIBPCRE2=YesPlease
268308
else
269-
USE_LIBPCRE=YesPlease
309+
USE_LIBPCRE2=YesPlease
270310
LIBPCREDIR=$withval
271311
AC_MSG_NOTICE([Setting LIBPCREDIR to $LIBPCREDIR])
272-
dnl USE_LIBPCRE can still be modified below, so don't substitute
312+
dnl USE_LIBPCRE2 can still be modified below, so don't substitute
273313
dnl it yet.
274314
GIT_CONF_SUBST([LIBPCREDIR])
275315
fi)
@@ -501,13 +541,11 @@ GIT_CONF_SUBST([NEEDS_SSL_WITH_CRYPTO])
501541
GIT_CONF_SUBST([NO_OPENSSL])
502542

503543
#
504-
# Define USE_LIBPCRE if you have and want to use libpcre. Various
505-
# commands such as log and grep offer runtime options to use
506-
# Perl-compatible regular expressions instead of standard or extended
507-
# POSIX regular expressions.
544+
# Handle the USE_LIBPCRE1 and USE_LIBPCRE2 options potentially set
545+
# above.
508546
#
509547

510-
if test -n "$USE_LIBPCRE"; then
548+
if test -n "$USE_LIBPCRE1"; then
511549

512550
GIT_STASH_FLAGS($LIBPCREDIR)
513551

@@ -517,7 +555,22 @@ AC_CHECK_LIB([pcre], [pcre_version],
517555

518556
GIT_UNSTASH_FLAGS($LIBPCREDIR)
519557

520-
GIT_CONF_SUBST([USE_LIBPCRE])
558+
GIT_CONF_SUBST([USE_LIBPCRE1])
559+
560+
fi
561+
562+
563+
if test -n "$USE_LIBPCRE2"; then
564+
565+
GIT_STASH_FLAGS($LIBPCREDIR)
566+
567+
AC_CHECK_LIB([pcre2-8], [pcre2_config_8],
568+
[USE_LIBPCRE2=YesPlease],
569+
[USE_LIBPCRE2=])
570+
571+
GIT_UNSTASH_FLAGS($LIBPCREDIR)
572+
573+
GIT_CONF_SUBST([USE_LIBPCRE2])
521574

522575
fi
523576

grep.c

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,22 +179,37 @@ static void grep_set_pattern_type_option(enum grep_pattern_type pattern_type, st
179179
case GREP_PATTERN_TYPE_BRE:
180180
opt->fixed = 0;
181181
opt->pcre1 = 0;
182+
opt->pcre2 = 0;
182183
break;
183184

184185
case GREP_PATTERN_TYPE_ERE:
185186
opt->fixed = 0;
186187
opt->pcre1 = 0;
188+
opt->pcre2 = 0;
187189
opt->regflags |= REG_EXTENDED;
188190
break;
189191

190192
case GREP_PATTERN_TYPE_FIXED:
191193
opt->fixed = 1;
192194
opt->pcre1 = 0;
195+
opt->pcre2 = 0;
193196
break;
194197

195198
case GREP_PATTERN_TYPE_PCRE:
196199
opt->fixed = 0;
200+
#ifdef USE_LIBPCRE2
201+
opt->pcre1 = 0;
202+
opt->pcre2 = 1;
203+
#else
204+
/*
205+
* It's important that pcre1 always be assigned to
206+
* even when there's no USE_LIBPCRE* defined. We still
207+
* call the PCRE stub function, it just dies with
208+
* "cannot use Perl-compatible regexes[...]".
209+
*/
197210
opt->pcre1 = 1;
211+
opt->pcre2 = 0;
212+
#endif
198213
break;
199214
}
200215
}
@@ -446,6 +461,127 @@ static void free_pcre1_regexp(struct grep_pat *p)
446461
}
447462
#endif /* !USE_LIBPCRE1 */
448463

464+
#ifdef USE_LIBPCRE2
465+
static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt)
466+
{
467+
int error;
468+
PCRE2_UCHAR errbuf[256];
469+
PCRE2_SIZE erroffset;
470+
int options = PCRE2_MULTILINE;
471+
const uint8_t *character_tables = NULL;
472+
int jitret;
473+
474+
assert(opt->pcre2);
475+
476+
p->pcre2_compile_context = NULL;
477+
478+
if (opt->ignore_case) {
479+
if (has_non_ascii(p->pattern)) {
480+
character_tables = pcre2_maketables(NULL);
481+
p->pcre2_compile_context = pcre2_compile_context_create(NULL);
482+
pcre2_set_character_tables(p->pcre2_compile_context, character_tables);
483+
}
484+
options |= PCRE2_CASELESS;
485+
}
486+
if (is_utf8_locale() && has_non_ascii(p->pattern))
487+
options |= PCRE2_UTF;
488+
489+
p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
490+
p->patternlen, options, &error, &erroffset,
491+
p->pcre2_compile_context);
492+
493+
if (p->pcre2_pattern) {
494+
p->pcre2_match_data = pcre2_match_data_create_from_pattern(p->pcre2_pattern, NULL);
495+
if (!p->pcre2_match_data)
496+
die("Couldn't allocate PCRE2 match data");
497+
} else {
498+
pcre2_get_error_message(error, errbuf, sizeof(errbuf));
499+
compile_regexp_failed(p, (const char *)&errbuf);
500+
}
501+
502+
pcre2_config(PCRE2_CONFIG_JIT, &p->pcre2_jit_on);
503+
if (p->pcre2_jit_on == 1) {
504+
jitret = pcre2_jit_compile(p->pcre2_pattern, PCRE2_JIT_COMPLETE);
505+
if (jitret)
506+
die("Couldn't JIT the PCRE2 pattern '%s', got '%d'\n", p->pattern, jitret);
507+
p->pcre2_jit_stack = pcre2_jit_stack_create(1, 1024 * 1024, NULL);
508+
if (!p->pcre2_jit_stack)
509+
die("Couldn't allocate PCRE2 JIT stack");
510+
p->pcre2_match_context = pcre2_match_context_create(NULL);
511+
if (!p->pcre2_jit_stack)
512+
die("Couldn't allocate PCRE2 match context");
513+
pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack);
514+
} else if (p->pcre2_jit_on != 0) {
515+
die("BUG: The pcre2_jit_on variable should be 0 or 1, not %d",
516+
p->pcre1_jit_on);
517+
}
518+
}
519+
520+
static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
521+
regmatch_t *match, int eflags)
522+
{
523+
int ret, flags = 0;
524+
PCRE2_SIZE *ovector;
525+
PCRE2_UCHAR errbuf[256];
526+
527+
if (eflags & REG_NOTBOL)
528+
flags |= PCRE2_NOTBOL;
529+
530+
if (p->pcre2_jit_on)
531+
ret = pcre2_jit_match(p->pcre2_pattern, (unsigned char *)line,
532+
eol - line, 0, flags, p->pcre2_match_data,
533+
NULL);
534+
else
535+
ret = pcre2_match(p->pcre2_pattern, (unsigned char *)line,
536+
eol - line, 0, flags, p->pcre2_match_data,
537+
NULL);
538+
539+
if (ret < 0 && ret != PCRE2_ERROR_NOMATCH) {
540+
pcre2_get_error_message(ret, errbuf, sizeof(errbuf));
541+
die("%s failed with error code %d: %s",
542+
(p->pcre2_jit_on ? "pcre2_jit_match" : "pcre2_match"), ret,
543+
errbuf);
544+
}
545+
if (ret > 0) {
546+
ovector = pcre2_get_ovector_pointer(p->pcre2_match_data);
547+
ret = 0;
548+
match->rm_so = (int)ovector[0];
549+
match->rm_eo = (int)ovector[1];
550+
}
551+
552+
return ret;
553+
}
554+
555+
static void free_pcre2_pattern(struct grep_pat *p)
556+
{
557+
pcre2_compile_context_free(p->pcre2_compile_context);
558+
pcre2_code_free(p->pcre2_pattern);
559+
pcre2_match_data_free(p->pcre2_match_data);
560+
pcre2_jit_stack_free(p->pcre2_jit_stack);
561+
pcre2_match_context_free(p->pcre2_match_context);
562+
}
563+
#else /* !USE_LIBPCRE2 */
564+
static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt)
565+
{
566+
/*
567+
* Unreachable until USE_LIBPCRE2 becomes synonymous with
568+
* USE_LIBPCRE. See the sibling comment in
569+
* grep_set_pattern_type_option().
570+
*/
571+
die("cannot use Perl-compatible regexes when not compiled with USE_LIBPCRE");
572+
}
573+
574+
static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
575+
regmatch_t *match, int eflags)
576+
{
577+
return 1;
578+
}
579+
580+
static void free_pcre2_pattern(struct grep_pat *p)
581+
{
582+
}
583+
#endif /* !USE_LIBPCRE2 */
584+
449585
static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
450586
{
451587
struct strbuf sb = STRBUF_INIT;
@@ -511,6 +647,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
511647
return;
512648
}
513649

650+
if (opt->pcre2) {
651+
compile_pcre2_pattern(p, opt);
652+
return;
653+
}
654+
514655
if (opt->pcre1) {
515656
compile_pcre1_regexp(p, opt);
516657
return;
@@ -870,6 +1011,8 @@ void free_grep_patterns(struct grep_opt *opt)
8701011
kwsfree(p->kws);
8711012
else if (p->pcre1_regexp)
8721013
free_pcre1_regexp(p);
1014+
else if (p->pcre2_pattern)
1015+
free_pcre2_pattern(p);
8731016
else
8741017
regfree(&p->regexp);
8751018
free(p->pattern);
@@ -950,6 +1093,8 @@ static int patmatch(struct grep_pat *p, char *line, char *eol,
9501093
hit = !fixmatch(p, line, eol, match);
9511094
else if (p->pcre1_regexp)
9521095
hit = !pcre1match(p, line, eol, match, eflags);
1096+
else if (p->pcre2_pattern)
1097+
hit = !pcre2match(p, line, eol, match, eflags);
9531098
else
9541099
hit = !regexec_buf(&p->regexp, line, eol - line, 1, match,
9551100
eflags);

0 commit comments

Comments
 (0)