-
Notifications
You must be signed in to change notification settings - Fork 584
fix(instrumentation-mysql2)!: Missing masking of sql queries #2732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(instrumentation-mysql2)!: Missing masking of sql queries #2732
Conversation
Issue: open-telemetry#1758 extend the MySQL2InstrumentationConfig with two variables: * maskStatementHook → This hook is used to escape SQL statements (optional) * maskStatement → Boolean (default: true), determines whether the maskStatementHook is applied at all default maskStatementHook: `return query.replace(/\b\d+\b/g, '?').replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?')`
…was parameterized. Added tests to confirm this behaviour
This comment was marked as outdated.
This comment was marked as outdated.
Please disregard the previous comment by the GH action - this component actually maintained by @raphael-theriault-swi now. I forgot to update the workflow. I'll open a PR to make sure this comment does not pop up again for this instrumentation. |
return query | ||
.replace(/\b\d+\b/g, '?') | ||
.replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no regex expert so this might be something obvious, but what's the reasoning for using a lookahead here rather than just a literal backslash ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me neither, but at least my understanding of regex is that the lookahead ((?=(\?))) is used to detect an optional backslash without consuming the next character right away. This ensures that escaped quotes (like ") inside a string don’t cause the pattern to end prematurely. By capturing the optional backslash in a group (\2) and then matching it with \2., both escaped and regular characters are handled consistently. Using a direct backslash match (e.g., \?) would make it harder to distinguish between cases where a backslash is present or not, leading to a more complex and less readable regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Could you maybe just add a comment to describe the behaviour of the default hook so future contributors don't have to parse the regex to understand what's happening ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the following comment as js-doc at the defaultMaskingHook
/**
* Replaces numeric values and quoted strings in the query with placeholders ('?').
*
* - `\b\d+\b`: Matches whole numbers (integers) and replaces them with '?'.
* - `(["'])(?:(?=(\\?))\2.)*?\1`:
* - Matches quoted strings (both single `'` and double `"` quotes).
* - Uses a lookahead `(?=(\\?))` to detect an optional backslash without consuming it immediately.
* - Captures the optional backslash `\2` and ensures escaped quotes inside the string are handled correctly.
* - Ensures that only complete quoted strings are replaced with '?'.
*
* This prevents accidental replacement of escaped quotes within strings and ensures that the
* query structure remains intact while masking sensitive data.
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you !
let formattedQuery: string; | ||
try { | ||
if (maskStatement) { | ||
formattedQuery = typeof query === 'string' ? query : query.sql; | ||
return maskStatementHook(formattedQuery); | ||
} | ||
|
||
if (!format) { | ||
formattedQuery = typeof query === 'string' ? query : query.sql; | ||
} else if (typeof query === 'string') { | ||
formattedQuery = values ? format(query, values) : query; | ||
} else { | ||
// According to https://github.com/mysqljs/mysql#performing-queries | ||
// The values argument will override the values in the option object. | ||
formattedQuery = | ||
values || (query as QueryOptions).values | ||
? format(query.sql, values || (query as QueryOptions).values) | ||
: query.sql; | ||
} | ||
return formattedQuery; | ||
} catch (e) { | ||
return 'Could not determine the query due to an error in masking or formatting'; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this series of overlapping conditionals to be really hard to follow, and I feel like it could definitely be simplified. Something like this maybe ? I'm also not sure there are any reasons to keep the formatting logic at all apart from backwards compatibility since it does the exact opposite of what the spec asks for.
let formattedQuery: string; | |
try { | |
if (maskStatement) { | |
formattedQuery = typeof query === 'string' ? query : query.sql; | |
return maskStatementHook(formattedQuery); | |
} | |
if (!format) { | |
formattedQuery = typeof query === 'string' ? query : query.sql; | |
} else if (typeof query === 'string') { | |
formattedQuery = values ? format(query, values) : query; | |
} else { | |
// According to https://github.com/mysqljs/mysql#performing-queries | |
// The values argument will override the values in the option object. | |
formattedQuery = | |
values || (query as QueryOptions).values | |
? format(query.sql, values || (query as QueryOptions).values) | |
: query.sql; | |
} | |
return formattedQuery; | |
} catch (e) { | |
return 'Could not determine the query due to an error in masking or formatting'; | |
} | |
const [querySql, queryValues] = typeof query === 'string' | |
? [query, values] | |
// According to https://github.com/mysqljs/mysql#performing-queries | |
// The values argument will override the values in the option object. | |
: [query.sql, values || query.values]; | |
try { | |
if (maskStatement) { | |
return maskStatementHook(querySql); | |
} else if (format && queryValues) { | |
return format(querySql, queryValues); | |
} else { | |
return querySql; | |
} | |
} catch (e) { | |
return 'Could not determine the query due to an error in masking or formatting'; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted it, but added the hasValues
Typecheck as well, since Typescript was complaining otherwise (Query don't has the properties values).
The formatting is actual needed, since its replacing the placeholders with the real values. If someone is setting maskStatement to 'false' I assume, the intent is, that all queries are pushed unmasked into the trace, not only already unmasked Queries. Therefore we need to resolve the parameterized Queries as well. I can remove it, if you think, that this is not beneficial at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the same as QueryOptions
cast as the previous code would have been fine here, or changing the signature of hasValues
to take Query | QueryOptions
and dropping the typeof check which I think is redundant. Not a big enough deal for me not to approve though.
} catch (e) { | ||
return 'Could not determine the query due to an error in masking or formatting'; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the rationale here is that we don't want to accidentally leak information by including the error message in the attribute value ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excatly. Since there could be some stacktrace with the actual string (like 'abc' ist not valid), its better to have this generic errormessage. Also, since the user can provide its own maskingHook, we never know, if there is any problematic code, which forces the leakage of information trough stacktrace. Therefore I thought this is the best way, to prevent this from happening. I can still adjust the errormessage, if you think its not clear enough :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current message seems perfect to me
* Adjusted getDbStatement function, so that its easier to understand * Added comment for defaultMaskingHook, so that its regex and its intent is easier to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, left some very minor suggestions but I'm happy with this either way !
} catch (e) { | ||
return 'Could not determine the query due to an error in masking or formatting'; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current message seems perfect to me
return query | ||
.replace(/\b\d+\b/g, '?') | ||
.replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you !
let formattedQuery: string; | ||
try { | ||
if (maskStatement) { | ||
formattedQuery = typeof query === 'string' ? query : query.sql; | ||
return maskStatementHook(formattedQuery); | ||
} | ||
|
||
if (!format) { | ||
formattedQuery = typeof query === 'string' ? query : query.sql; | ||
} else if (typeof query === 'string') { | ||
formattedQuery = values ? format(query, values) : query; | ||
} else { | ||
// According to https://github.com/mysqljs/mysql#performing-queries | ||
// The values argument will override the values in the option object. | ||
formattedQuery = | ||
values || (query as QueryOptions).values | ||
? format(query.sql, values || (query as QueryOptions).values) | ||
: query.sql; | ||
} | ||
return formattedQuery; | ||
} catch (e) { | ||
return 'Could not determine the query due to an error in masking or formatting'; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the same as QueryOptions
cast as the previous code would have been fine here, or changing the signature of hasValues
to take Query | QueryOptions
and dropping the typeof check which I think is redundant. Not a big enough deal for me not to approve though.
Thanks :) Github still claims, that a Review from a Maintainer ist missing 🤔 Do I have to do soemthing else, or is it still the Bug mentioned by @pichlermarc ? |
Approvers are still the ones doing the workflow approval and merging, so this is still waiting on Marc or someone else to merge yes. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2732 +/- ##
==========================================
+ Coverage 89.05% 89.07% +0.02%
==========================================
Files 188 188
Lines 9189 9197 +8
Branches 1891 1890 -1
==========================================
+ Hits 8183 8192 +9
+ Misses 1006 1005 -1
🚀 New features to boost your workflow:
|
Hmm, looks like the lint step is failing. |
Would appreciate the changes to |
No problem, I'll do both tomorrow morning (MEZ, UTC +1) |
Done, it was complaining about an unnecessary type annotation in maskStatement : boolean = true, since it could already infer the type from the assigned value true. |
I added it in the current commit :) |
Hi @gflachs , thank you for working on this. Also, since this could cause a performance issue, I don't think we should have |
Hey @maryliag, I’ve put together a quick benchmark to dig into this: const { performance } = require('perf_hooks');
const fs = require('fs');
const os = require('os');
function getSystemInfo() {
let cpuLimit = "Unlimited";
let memLimit = "Unlimited";
try {
const cpuData = fs.readFileSync('/sys/fs/cgroup/cpu.max', 'utf8').trim().split(' ');
if (cpuData[0] !== 'max') {
cpuLimit = (parseInt(cpuData[0]) / 1000) + " ms CPU time per 100ms";
}
} catch (err) {
console.warn("⚠️ Could not retrieve CPU limit.");
}
try {
const memData = fs.readFileSync('/sys/fs/cgroup/memory.max', 'utf8').trim();
if (memData !== 'max') {
memLimit = (parseInt(memData) / (1024 ** 2)).toFixed(2) + " MB";
}
} catch (err) {
console.warn("⚠️ Could not retrieve memory limit.");
}
return {
cpu: os.cpus()[0].model,
cores: os.cpus().length,
totalMemGB: (os.totalmem() / (1024 ** 3)).toFixed(2) + " GB",
platform: os.platform(),
release: os.release(),
containerCPU: cpuLimit,
containerMemory: memLimit
};
}
function regexMasking(query) {
return query.replace(/\b\d+\b/g, '?').replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?');
}
function generateLargeSQLQuery(size) {
let query = "SELECT id, name, age FROM users WHERE age = 25 AND name = 'John Doe' ";
while (query.length < size) {
query += "UNION SELECT id, 'Random Name', 30 FROM users WHERE id = " + Math.floor(Math.random() * 1000) + " ";
}
return query.slice(0, size);
}
function benchmarkFunction(fn, query, iterations = 3) {
let totalTime = 0;
for (let i = 0; i < iterations; i++) {
let start = performance.now();
fn(query);
totalTime += (performance.now() - start);
}
return (totalTime / iterations).toFixed(2);
}
function benchmark() {
const sizes = [100, 1000, 10000, 50000, 100000, 1000000];
console.log("\n🔍 **System Specifications:**\n", JSON.stringify(getSystemInfo(), null, 2));
console.log("\n🚀 **Benchmark: SQL Masking Performance**");
console.log("| SQL Size | Regex (ms) |");
console.log("|----------|------------|");
for (let size of sizes) {
const query = generateLargeSQLQuery(size);
let regexTime = benchmarkFunction(regexMasking, query);
console.log(`| ${size.toLocaleString()} | ${regexTime} |`);
}
}
benchmark(); Benchmarking within a docker-container to simulate different server-enviroments: docker run --rm -it --cpus="0.25" --memory="128m" benchmark_sql:latest 🔍 System Specifications: 🚀 Benchmark: SQL Masking Performance
docker run --rm -it --cpus="0.1" --memory="50m" benchmark_sql:latest 🔍 System Specifications: 🚀 Benchmark: SQL Masking Performance
docker run --rm -it --cpus="0.5" --memory="1g" benchmark_sql:latest 🔍 System Specifications: 🚀 Benchmark: SQL Masking Performance
My TakeI hear your concern about regex and replacements being costly, especially with large queries like the 20k+ character ones you mentioned. The benchmarks show that for smaller queries (up to 10k characters), the impact is negligible (<0.1 ms). For larger queries (e.g., 1M characters), it can climb to 5–67 ms depending on resources, though a 20k-character query would likely fall in the 0.1–0.2 ms range—still fast, but potentially noticeable under heavy load. That said, I’d like to push back a bit on setting maskStatement to false by default. Data security and privacy are critical, and masking sensitive data (like numbers and strings) in outputs is a key safeguard. The performance hit, even in extreme cases, seems manageable in most realistic scenarios, and I’d argue the security benefits outweigh the cost. Turning it off by default could expose sensitive data unnecessarily, which feels riskier than a slight performance dip. |
Thank you for performing the tests. Let me clarify my suggestion, I think the default should be:
then they have the option to:
in the future we want to take advantage of the config file, so there people will be able to select one of the options:
we can even have other performance improvements, such as if query is up to X characters mask it, otherwise remove completely, or even cut to X characters and then mask it My main concern at the moment is to make this behaviour the default, and a lot of people use ORMs and those tend to create really huge queries |
I'd be curious to see the delta between the masking and |
Hi @maryliag @raphael-theriault-swi - just double-checking since I've passed by this PR multiple times on my ready-to-merge list: is this PR ready to merge, or is some additional input/changes needed from @gflachs? (to match semconv guidance or other things) 🙂 |
It's not ready to merge. Changes needed: the default value should be |
Please read my comment for the purpose of the maskStatement flag :) |
This is why I said that the it needs to be aligned, when the value is |
@@ -48,6 +48,9 @@ You can set the following instrumentation options: | |||
| ------- | ---- | ----------- | | |||
| `responseHook` | `MySQL2InstrumentationExecutionResponseHook` (function) | Function for adding custom attributes from db response | | |||
| `addSqlCommenterCommentToQueries` | `boolean` | If true, adds [sqlcommenter](https://github.com/open-telemetry/opentelemetry-sqlcommenter) specification compliant comment to queries with tracing context (default false). _NOTE: A comment will not be added to queries that already contain `--` or `/* ... */` in them, even if these are not actually part of comments_ | | |||
| `maskStatement` | `boolean` | If true, masks the `db.statement` attribute in spans (default true) with the `maskStatementHook` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the performance concerns around applying regex to large queries, I think this should not be on by default. In semconv the recommendation is to not include values from parameterized queries by default, which should remove the need to parse the string query text. This should only apply to non-parameterized queries. Now that the database semconv is stable, a lot of our instrumentations need to be updated and the parameterized value stripping (or rather not-including) should be included there.
@gflachs I think this is quite close, just needs the off-by-default comment addressed |
…trib into fix!-Mysql2Instrumentation_Fix_Masking
I updated the default value of maskStatement to false, and made the correspondent changes in the doc and PR description. |
Which problem is this PR solving?
Based on Issue #1758 and my comment in the PR #1857 this PR is a quickfix for missing masking from sql queries.
Since OpenTelemetry instrumentation should remain as lightweight as possible, introducing a full SQL parser would be too heavy. As a result, the default masking approach is intentionally simplistic. However, users can define their own
maskStatementHook
, allowing them to leverage a SQL parser or other custom logic for more advanced masking if needed.Additionally, some use cases may require no masking at all. To accommodate this, the
maskStatement
setting allows users to disable masking entirely, preserving the original behavior of the MySQL2Instrumentation as it was before.Short description of the changes
This PR extends MySQL2InstrumentationConfig with two new options:
maskStatementHook
→ A customizable hook for masking SQL statements (optional). Default:maskStatement
→ A boolean flag (default: false) that determines whether maskStatementHook is applied at all.