Skip to content

PCRE2 Different Behavior Depending On Optimization Level #147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pegvin opened this issue Sep 24, 2022 · 9 comments
Closed

PCRE2 Different Behavior Depending On Optimization Level #147

pegvin opened this issue Sep 24, 2022 · 9 comments
Labels
bug Something isn't working
Milestone

Comments

@pegvin
Copy link

pegvin commented Sep 24, 2022

I'm using PCRE2 for my project aru for syntax highlighting. apparently everything works fine except when the certain optimizations are enabled PCRE2 behaves differently.

This is the video demonstrating the problem:

2022-09-23.19-35-59.mp4

Flags For Debug Build:

-std=c99 -Wall -Wextra -O0 -g -pedantic
-Wno-unknown-pragma -Wno-unused-function
-fsanitize=address -fsanitize=undefined
-DIS_DEBUG -DLOG_USE_COLOR -DPCRE2_STATIC -DHAVE_CONFIG_H
-DPCRE2_CODE_UNIT_WIDTH=8 -DSUPPORT_UNICODE -DSUPPORT_UTF8
-lm -lasan -lubsan

Flags For Release Build:

-std=c99 -Wall -Os
-Wno-unknown-pragma
-DLOG_USE_COLOR -DPCRE2_STATIC -DHAVE_CONFIG_H
-DPCRE2_CODE_UNIT_WIDTH=8 -DSUPPORT_UNICODE -DSUPPORT_UTF8
-lm -lasan -lubsan

i'm not compiling the code with default Cmake or something provided with in the repository instead i'm compiling some specific files:

config.h pcre2.h pcre2_auto_possess.c pcre2_chartables.c
pcre2_compile.c pcre2_config.c pcre2_context.c pcre2_convert.c
pcre2_dfa_match.c pcre2_error.c pcre2_extuni.c pcre2_find_bracket.c
pcre2_fuzzsupport.c pcre2_internal.h pcre2_intmodedep.h pcre2_jit_compile.c
pcre2_jit_match.c pcre2_jit_misc.c pcre2_maketables.c pcre2_match.c
pcre2_match_data.c pcre2_newline.c pcre2_ord2utf.c pcre2_pattern_info.c
pcre2posix.c pcre2posix.h pcre2_script_run.c pcre2_serialize.c pcre2_string_utils.c
pcre2_study.c pcre2_substitute.c pcre2_substring.c pcre2_tables.c
pcre2_ucd.c pcre2_ucp.h pcre2_ucptables.c pcre2_valid_utf.c pcre2_xclass.c
@zherczeg
Copy link
Collaborator

Please be more specific. PCRE2 has a test program called pcre2test, you can show the problematic patter/input pairs there.

@pegvin
Copy link
Author

pegvin commented Sep 24, 2022

@zherczeg i've updated the issue, i'll try pcre2test

@pegvin
Copy link
Author

pegvin commented Sep 24, 2022

the pcre2test works fine but for some reason that same compiled library works weirdly with my code. i'm just using ASCII

@zherczeg
Copy link
Collaborator

I am sorry but we don't really know about the internals of your system, and the description is too generic (a difference between -O0 and -Os). We need some pattern/input pair, and some compile / match flags to work with.

@pegvin
Copy link
Author

pegvin commented Sep 24, 2022

so these are the patterns i'm using:

filePattern = \.([ch](pp|xx)?|C|cc|c\+\+|cu|H|hh|ii?)$  # File Extensions
pattern1 = //.* # Comment //
pattern2 = (^#define*)|(^#include*)|(^#if*)|(^#ifndef*)|(^#ifdef*)|(^#endif*)|(^#elif*)|(^#else*)|(^#elseif*)|(^#warning*)|(^#error*) # Pre-Processor Directives
pattern3 = (auto|bool|char|const|double|enum|extern|float|inline|int|long|restrict|short|signed|sizeof|static|struct|typedef|union|unsigned|void) # Keywords
pattern4 = ([[:lower:]][[:lower:]_]*|(u_?)?int(8|16|32|64))_t # Types Like uint_8
pattern5 = (if|else|for|while|do|switch|case|default) # Keywords
pattern6 = [A-Z_][0-9A-Z_]*
pattern7 = ^[[:blank:]]*[A-Z_a-z][0-9A-Z_a-z]*:[[:blank:]]*$
pattern9 = (class|explicit|friend|mutable|namespace|override|private|protected|public|register|template|this|typename|using|virtual|volatile) # Keywords
pattern10 = (try|throw|catch|operator|new|delete) # Keywords
pattern11 = (break|continue|goto|return) # Keywords
pattern14 = <[^>]+> # For <header.h> in #include <header.h>

i'm compiling these regexes with pcre2_compile function with PCRE2_UTF & PCRE2_MULTILINE flags.
i'm matching it with pcre2_match function with PCRE2_NO_JIT flag.

after matching is done i check if there are matches, if yes i use this logic:

for (int i = 0; i < rc; i++) {
	long int start = ovector[i], end = ovector[i + 1];
	if (start < 0 || end < 0)
		continue;

	if (callback)
		callback(start < end ? start : end, start < end ? end : start, data); // Basically Pass The Smaller Value As Start & Bigger Value as End
		totalFound++;
		printf(", RC: %d, Start: %ld, End: %ld\n", rc, start, end);
}

where rc is the return value from pcre2_match, ovector is return value from pcre2_get_ovector_pointer.

and then while iterating over the values i call a callback function with passing it the start and end index of the match in the string.

@zherczeg
Copy link
Collaborator

Thanks, this is more helpful. Can you check which specific pattern fail, and what is the ovector content in that case?

@PhilipHazel
Copy link
Collaborator

Is this still an issue?

@pegvin
Copy link
Author

pegvin commented Nov 24, 2022

it is indeed

@NWilson NWilson added the bug Something isn't working label Dec 8, 2024
@NWilson NWilson added this to the 10.46 milestone Jan 8, 2025
@NWilson
Copy link
Member

NWilson commented Feb 4, 2025

@pegvin I have looked at your code.

PCRE2 does not behave differently based on the optimisation level. What you are observing is that your code is behaving differently.

	int rc = pcre2_match(p->re, (PCRE2_SPTR)str, strlen(str), 0, PCRE2_NO_JIT, p->md == NULL ? matchData : p->md, NULL);
	if (rc < 0) {
#if IS_DEBUG
		if (rc == PCRE2_ERROR_NOMATCH) {
			log_warn("No Matches Found!");
		} else {
			PCRE2_UCHAR buffer[120];
			pcre2_get_error_message(rc, buffer, sizeof(buffer));
			log_error("regex matching error %d: %s in regex: %s", rc, buffer, str);
		}
	} else if (rc == 1) {
		log_warn("No Matches Found!");
#endif
	} else {
		PCRE2_SIZE* ovector = pcre2_get_ovector_pointer(p->md == NULL ? matchData : p->md);

In a Debug build, aru rejects the case of rc = 1, but in a release build, this case is passed on to the callback.

It also looks like you have not quite understood the meaning of the rc variable. It takes the values 1,2,3,... to indicate "one plus the highest-numbered capture group". So, if there are no capture groups matched, you get rc = 1 (but the pattern as a whole does match). If there is a single capture group captured, you get rc = 2.

Your for-loop should be:

for (int i = 0; i < rc; i++) {
  long int start = ovector[2*i], end = ovector[2*i + 1];

rc tells you, "this many pairs of elements in the ovector can be used".

@NWilson NWilson closed this as completed Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants