Skip to content

fix(instrumentation-mysql2)!: Missing masking of sql queries #2732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

gflachs
Copy link
Contributor

@gflachs gflachs commented Feb 25, 2025

Which problem is this PR solving?

Based on Issue #1758 and my comment in the PR #1857 this PR is a quickfix for missing masking from sql queries.

Since OpenTelemetry instrumentation should remain as lightweight as possible, introducing a full SQL parser would be too heavy. As a result, the default masking approach is intentionally simplistic. However, users can define their own maskStatementHook, allowing them to leverage a SQL parser or other custom logic for more advanced masking if needed.

Additionally, some use cases may require no masking at all. To accommodate this, the maskStatement setting allows users to disable masking entirely, preserving the original behavior of the MySQL2Instrumentation as it was before.

Short description of the changes

This PR extends MySQL2InstrumentationConfig with two new options:

  • maskStatementHook → A customizable hook for masking SQL statements (optional). Default:
 return query.replace(/\b\d+\b/g, '?').replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?')
  • maskStatement → A boolean flag (default: false) that determines whether maskStatementHook is applied at all.

Issue: open-telemetry#1758
extend the MySQL2InstrumentationConfig with two variables:
* maskStatementHook → This hook is used to escape SQL statements (optional)
* maskStatement → Boolean (default: true), determines whether the maskStatementHook is applied at all

default maskStatementHook:
`return query.replace(/\b\d+\b/g, '?').replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?')`
@gflachs gflachs requested a review from a team as a code owner February 25, 2025 11:10
Copy link

linux-foundation-easycla bot commented Feb 25, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@github-actions github-actions bot added pkg:instrumentation-mysql2 pkg-status:unmaintained This package is unmaintained. Only bugfixes may be acceped until a new owner has been found. labels Feb 25, 2025
@gflachs gflachs changed the title fix(instrumentation-mysql2)!:Missing masking of sql queries fix(instrumentation-mysql2)!: Missing masking of sql queries Feb 25, 2025

This comment was marked as outdated.

@pichlermarc pichlermarc removed pkg-status:unmaintained This package is unmaintained. Only bugfixes may be acceped until a new owner has been found. pkg-status:unmaintained:autoclose-scheduled labels Feb 26, 2025
@pichlermarc
Copy link
Member

pichlermarc commented Feb 26, 2025

Please disregard the previous comment by the GH action - this component actually maintained by @raphael-theriault-swi now. I forgot to update the workflow.

I'll open a PR to make sure this comment does not pop up again for this instrumentation.

Comment on lines 141 to 143
return query
.replace(/\b\d+\b/g, '?')
.replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?');

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no regex expert so this might be something obvious, but what's the reasoning for using a lookahead here rather than just a literal backslash ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me neither, but at least my understanding of regex is that the lookahead ((?=(\?))) is used to detect an optional backslash without consuming the next character right away. This ensures that escaped quotes (like ") inside a string don’t cause the pattern to end prematurely. By capturing the optional backslash in a group (\2) and then matching it with \2., both escaped and regular characters are handled consistently. Using a direct backslash match (e.g., \?) would make it harder to distinguish between cases where a backslash is present or not, leading to a more complex and less readable regex.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Could you maybe just add a comment to describe the behaviour of the default hook so future contributors don't have to parse the regex to understand what's happening ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the following comment as js-doc at the defaultMaskingHook

/**
 * Replaces numeric values and quoted strings in the query with placeholders ('?').
 *
 * - `\b\d+\b`: Matches whole numbers (integers) and replaces them with '?'.
 * - `(["'])(?:(?=(\\?))\2.)*?\1`:
 *   - Matches quoted strings (both single `'` and double `"` quotes).
 *   - Uses a lookahead `(?=(\\?))` to detect an optional backslash without consuming it immediately.
 *   - Captures the optional backslash `\2` and ensures escaped quotes inside the string are handled correctly.
 *   - Ensures that only complete quoted strings are replaced with '?'.
 *
 * This prevents accidental replacement of escaped quotes within strings and ensures that the
 * query structure remains intact while masking sensitive data.
 */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !

Comment on lines 115 to 137
let formattedQuery: string;
try {
if (maskStatement) {
formattedQuery = typeof query === 'string' ? query : query.sql;
return maskStatementHook(formattedQuery);
}

if (!format) {
formattedQuery = typeof query === 'string' ? query : query.sql;
} else if (typeof query === 'string') {
formattedQuery = values ? format(query, values) : query;
} else {
// According to https://github.com/mysqljs/mysql#performing-queries
// The values argument will override the values in the option object.
formattedQuery =
values || (query as QueryOptions).values
? format(query.sql, values || (query as QueryOptions).values)
: query.sql;
}
return formattedQuery;
} catch (e) {
return 'Could not determine the query due to an error in masking or formatting';
}
Copy link
Member

@raphael-theriault-swi raphael-theriault-swi Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this series of overlapping conditionals to be really hard to follow, and I feel like it could definitely be simplified. Something like this maybe ? I'm also not sure there are any reasons to keep the formatting logic at all apart from backwards compatibility since it does the exact opposite of what the spec asks for.

Suggested change
let formattedQuery: string;
try {
if (maskStatement) {
formattedQuery = typeof query === 'string' ? query : query.sql;
return maskStatementHook(formattedQuery);
}
if (!format) {
formattedQuery = typeof query === 'string' ? query : query.sql;
} else if (typeof query === 'string') {
formattedQuery = values ? format(query, values) : query;
} else {
// According to https://github.com/mysqljs/mysql#performing-queries
// The values argument will override the values in the option object.
formattedQuery =
values || (query as QueryOptions).values
? format(query.sql, values || (query as QueryOptions).values)
: query.sql;
}
return formattedQuery;
} catch (e) {
return 'Could not determine the query due to an error in masking or formatting';
}
const [querySql, queryValues] = typeof query === 'string'
? [query, values]
// According to https://github.com/mysqljs/mysql#performing-queries
// The values argument will override the values in the option object.
: [query.sql, values || query.values];
try {
if (maskStatement) {
return maskStatementHook(querySql);
} else if (format && queryValues) {
return format(querySql, queryValues);
} else {
return querySql;
}
} catch (e) {
return 'Could not determine the query due to an error in masking or formatting';
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted it, but added the hasValues Typecheck as well, since Typescript was complaining otherwise (Query don't has the properties values).

The formatting is actual needed, since its replacing the placeholders with the real values. If someone is setting maskStatement to 'false' I assume, the intent is, that all queries are pushed unmasked into the trace, not only already unmasked Queries. Therefore we need to resolve the parameterized Queries as well. I can remove it, if you think, that this is not beneficial at all.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same as QueryOptions cast as the previous code would have been fine here, or changing the signature of hasValues to take Query | QueryOptions and dropping the typeof check which I think is redundant. Not a big enough deal for me not to approve though.

Comment on lines 135 to 137
} catch (e) {
return 'Could not determine the query due to an error in masking or formatting';
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the rationale here is that we don't want to accidentally leak information by including the error message in the attribute value ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excatly. Since there could be some stacktrace with the actual string (like 'abc' ist not valid), its better to have this generic errormessage. Also, since the user can provide its own maskingHook, we never know, if there is any problematic code, which forces the leakage of information trough stacktrace. Therefore I thought this is the best way, to prevent this from happening. I can still adjust the errormessage, if you think its not clear enough :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current message seems perfect to me

gflachs and others added 2 commits March 3, 2025 14:49
* Adjusted getDbStatement function, so that its easier to understand
* Added comment for defaultMaskingHook, so that its regex and its intent is easier to understand
Copy link
Member

@raphael-theriault-swi raphael-theriault-swi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left some very minor suggestions but I'm happy with this either way !

Comment on lines 135 to 137
} catch (e) {
return 'Could not determine the query due to an error in masking or formatting';
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current message seems perfect to me

Comment on lines 141 to 143
return query
.replace(/\b\d+\b/g, '?')
.replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?');

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !

Comment on lines 115 to 137
let formattedQuery: string;
try {
if (maskStatement) {
formattedQuery = typeof query === 'string' ? query : query.sql;
return maskStatementHook(formattedQuery);
}

if (!format) {
formattedQuery = typeof query === 'string' ? query : query.sql;
} else if (typeof query === 'string') {
formattedQuery = values ? format(query, values) : query;
} else {
// According to https://github.com/mysqljs/mysql#performing-queries
// The values argument will override the values in the option object.
formattedQuery =
values || (query as QueryOptions).values
? format(query.sql, values || (query as QueryOptions).values)
: query.sql;
}
return formattedQuery;
} catch (e) {
return 'Could not determine the query due to an error in masking or formatting';
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same as QueryOptions cast as the previous code would have been fine here, or changing the signature of hasValues to take Query | QueryOptions and dropping the typeof check which I think is redundant. Not a big enough deal for me not to approve though.

@gflachs
Copy link
Contributor Author

gflachs commented Mar 4, 2025

Looks good, left some very minor suggestions but I'm happy with this either way !

Thanks :) Github still claims, that a Review from a Maintainer ist missing 🤔 Do I have to do soemthing else, or is it still the Bug mentioned by @pichlermarc ?

@raphael-theriault-swi
Copy link
Member

Approvers are still the ones doing the workflow approval and merging, so this is still waiting on Marc or someone else to merge yes.

@pichlermarc pichlermarc added the has:owner-approval Approved by Component Owner label Mar 5, 2025
Copy link

codecov bot commented Mar 5, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.07%. Comparing base (8d05684) to head (963c343).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2732      +/-   ##
==========================================
+ Coverage   89.05%   89.07%   +0.02%     
==========================================
  Files         188      188              
  Lines        9189     9197       +8     
  Branches     1891     1890       -1     
==========================================
+ Hits         8183     8192       +9     
+ Misses       1006     1005       -1     
Files with missing lines Coverage Δ
...ages/instrumentation-mysql2/src/instrumentation.ts 95.65% <100.00%> (ø)
packages/instrumentation-mysql2/src/utils.ts 94.54% <100.00%> (+3.05%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@pichlermarc
Copy link
Member

Hmm, looks like the lint step is failing.
@gflachs, please run npm run lint:fix and commit the results. Thanks! 🙂

@raphael-theriault-swi
Copy link
Member

Would appreciate the changes to hasValues at the same time since another commit is needed.

@gflachs
Copy link
Contributor Author

gflachs commented Mar 6, 2025

Would appreciate the changes to hasValues at the same time since another commit is needed.

No problem, I'll do both tomorrow morning (MEZ, UTC +1)

@gflachs
Copy link
Contributor Author

gflachs commented Mar 7, 2025

Hmm, looks like the lint step is failing. @gflachs, please run npm run lint:fix and commit the results. Thanks! 🙂

Done, it was complaining about an unnecessary type annotation in maskStatement : boolean = true, since it could already infer the type from the assigned value true.

@gflachs
Copy link
Contributor Author

gflachs commented Mar 7, 2025

Would appreciate the changes to hasValues at the same time since another commit is needed.

I added it in the current commit :)

@maryliag
Copy link
Contributor

Hi @gflachs , thank you for working on this.
I have concerns with the performance of the masking itself. Using regex and replacements can be very costly, specially in large queries (I have seen queries with over 20k characters). Can you provide some benchmark comparing with and without your changes, considering a heavy workload with big queries?

Also, since this could cause a performance issue, I don't think we should have maskStatement as default true

@gflachs
Copy link
Contributor Author

gflachs commented Mar 10, 2025

Hi @gflachs , thank you for working on this. I have concerns with the performance of the masking itself. Using regex and replacements can be very costly, specially in large queries (I have seen queries with over 20k characters). Can you provide some benchmark comparing with and without your changes, considering a heavy workload with big queries?

Also, since this could cause a performance issue, I don't think we should have maskStatement as default true

Hey @maryliag,
Thanks for your reply and for raising your concerns about the performance of the masking itself—I totally get where you’re coming from, though I don’t fully share the same worry since it’s always a tradeoff between security/privacy and performance.

I’ve put together a quick benchmark to dig into this:

const { performance } = require('perf_hooks');
const fs = require('fs');
const os = require('os');

function getSystemInfo() {
    let cpuLimit = "Unlimited";
    let memLimit = "Unlimited";

    try {
        const cpuData = fs.readFileSync('/sys/fs/cgroup/cpu.max', 'utf8').trim().split(' ');
        if (cpuData[0] !== 'max') {
            cpuLimit = (parseInt(cpuData[0]) / 1000) + " ms CPU time per 100ms";
        }
    } catch (err) {
        console.warn("⚠️  Could not retrieve CPU limit.");
    }

    try {
        const memData = fs.readFileSync('/sys/fs/cgroup/memory.max', 'utf8').trim();
        if (memData !== 'max') {
            memLimit = (parseInt(memData) / (1024 ** 2)).toFixed(2) + " MB";
        }
    } catch (err) {
        console.warn("⚠️  Could not retrieve memory limit.");
    }

    return {
        cpu: os.cpus()[0].model,
        cores: os.cpus().length,
        totalMemGB: (os.totalmem() / (1024 ** 3)).toFixed(2) + " GB",
        platform: os.platform(),
        release: os.release(),
        containerCPU: cpuLimit,
        containerMemory: memLimit
    };
}

function regexMasking(query) {
    return query.replace(/\b\d+\b/g, '?').replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '?');
}

function generateLargeSQLQuery(size) {
    let query = "SELECT id, name, age FROM users WHERE age = 25 AND name = 'John Doe' ";
    while (query.length < size) {
        query += "UNION SELECT id, 'Random Name', 30 FROM users WHERE id = " + Math.floor(Math.random() * 1000) + " ";
    }
    return query.slice(0, size);
}

function benchmarkFunction(fn, query, iterations = 3) {
    let totalTime = 0;
    for (let i = 0; i < iterations; i++) {
        let start = performance.now();
        fn(query);
        totalTime += (performance.now() - start);
    }
    return (totalTime / iterations).toFixed(2);
}

function benchmark() {
    const sizes = [100, 1000, 10000, 50000, 100000, 1000000];

    console.log("\n🔍 **System Specifications:**\n", JSON.stringify(getSystemInfo(), null, 2));
    console.log("\n🚀 **Benchmark: SQL Masking Performance**");
    console.log("| SQL Size | Regex (ms) |");
    console.log("|----------|------------|");

    for (let size of sizes) {
        const query = generateLargeSQLQuery(size);

        let regexTime = benchmarkFunction(regexMasking, query);

        console.log(`| ${size.toLocaleString()} | ${regexTime} |`);
    }
}

benchmark();

Benchmarking within a docker-container to simulate different server-enviroments:

docker run --rm -it --cpus="0.25" --memory="128m" benchmark_sql:latest

🔍 System Specifications:
{
"cpu": "AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics",
"cores": 16,
"totalMemGB": "30.00 GB",
"platform": "linux",
"release": "6.13.5-200.fc41.x86_64",
"containerCPU": "25 ms CPU time per 100ms",
"containerMemory": "128.00 MB"
}

🚀 Benchmark: SQL Masking Performance

SQL Size Regex (ms)
100 0.07
1,000 0.01
10,000 0.08
50,000 1.11
100,000 0.41
1,000,000 7.56

docker run --rm -it --cpus="0.1" --memory="50m" benchmark_sql:latest

🔍 System Specifications:
{
"cpu": "AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics",
"cores": 16,
"totalMemGB": "30.00 GB",
"platform": "linux",
"release": "6.13.5-200.fc41.x86_64",
"containerCPU": "10 ms CPU time per 100ms",
"containerMemory": "50.00 MB"
}

🚀 Benchmark: SQL Masking Performance

SQL Size Regex (ms)
100 0.06
1,000 0.01
10,000 0.07
50,000 0.48
100,000 0.38
1,000,000 67.37

docker run --rm -it --cpus="0.5" --memory="1g" benchmark_sql:latest

🔍 System Specifications:
{
"cpu": "AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics",
"cores": 16,
"totalMemGB": "30.00 GB",
"platform": "linux",
"release": "6.13.5-200.fc41.x86_64",
"containerCPU": "50 ms CPU time per 100ms",
"containerMemory": "1024.00 MB"
}

🚀 Benchmark: SQL Masking Performance

SQL Size Regex (ms)
100 0.03
1,000 0.01
10,000 0.10
50,000 0.56
100,000 0.43
1,000,000 5.67

My Take

I hear your concern about regex and replacements being costly, especially with large queries like the 20k+ character ones you mentioned. The benchmarks show that for smaller queries (up to 10k characters), the impact is negligible (<0.1 ms). For larger queries (e.g., 1M characters), it can climb to 5–67 ms depending on resources, though a 20k-character query would likely fall in the 0.1–0.2 ms range—still fast, but potentially noticeable under heavy load.

That said, I’d like to push back a bit on setting maskStatement to false by default. Data security and privacy are critical, and masking sensitive data (like numbers and strings) in outputs is a key safeguard. The performance hit, even in extreme cases, seems manageable in most realistic scenarios, and I’d argue the security benefits outweigh the cost. Turning it off by default could expose sensitive data unnecessarily, which feels riskier than a slight performance dip.

@maryliag
Copy link
Contributor

Thank you for performing the tests. Let me clarify my suggestion, I think the default should be:

  • not send this data, since I agree that the security risk is a top priority

then they have the option to:

  • enable to send data and with that we would have the mask of data

in the future we want to take advantage of the config file, so there people will be able to select one of the options:

  1. I'm already sending data without the values, so you can send as is
  2. I'm sending confidential data, so I want to remove everything and not mask, because my system cannot afford performance hits
  3. I'm sending confidential data and I want it to be masked, since my system can afford performance hits

we can even have other performance improvements, such as if query is up to X characters mask it, otherwise remove completely, or even cut to X characters and then mask it

My main concern at the moment is to make this behaviour the default, and a lot of people use ORMs and those tend to create really huge queries

@raphael-theriault-swi
Copy link
Member

I'd be curious to see the delta between the masking and sqlstring's format (which is currently used by default), given the latter also uses regex on top of some other non trivial logic. Don't disagree with the performance concerns but I have a feeling masking might not make it worse than it currently is.

@pichlermarc
Copy link
Member

Hi @maryliag @raphael-theriault-swi - just double-checking since I've passed by this PR multiple times on my ready-to-merge list:

is this PR ready to merge, or is some additional input/changes needed from @gflachs? (to match semconv guidance or other things) 🙂

@maryliag
Copy link
Contributor

It's not ready to merge.
Current behaviour: the value os maskStatement is set to default true and sanitization is being made by default.

Changes needed: the default value should be false and no information should be sent on that case. Then IF the value is true, sanitization should be done and the query text can be sent

@gflachs
Copy link
Contributor Author

gflachs commented Apr 28, 2025

Please read my comment for the purpose of the maskStatement flag :)

@maryliag
Copy link
Contributor

Turning it off by default could expose sensitive data unnecessarily

This is why I said that the it needs to be aligned, when the value is false, nothing it sent (so we don't potentially send sensitive information), and when the value is true, you can do the masking + sending the data.

@@ -48,6 +48,9 @@ You can set the following instrumentation options:
| ------- | ---- | ----------- |
| `responseHook` | `MySQL2InstrumentationExecutionResponseHook` (function) | Function for adding custom attributes from db response |
| `addSqlCommenterCommentToQueries` | `boolean` | If true, adds [sqlcommenter](https://github.com/open-telemetry/opentelemetry-sqlcommenter) specification compliant comment to queries with tracing context (default false). _NOTE: A comment will not be added to queries that already contain `--` or `/* ... */` in them, even if these are not actually part of comments_ |
| `maskStatement` | `boolean` | If true, masks the `db.statement` attribute in spans (default true) with the `maskStatementHook` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the performance concerns around applying regex to large queries, I think this should not be on by default. In semconv the recommendation is to not include values from parameterized queries by default, which should remove the need to parse the string query text. This should only apply to non-parameterized queries. Now that the database semconv is stable, a lot of our instrumentations need to be updated and the parameterized value stripping (or rather not-including) should be included there.

@dyladan
Copy link
Member

dyladan commented Jun 25, 2025

@gflachs I think this is quite close, just needs the off-by-default comment addressed

@david-luna
Copy link
Contributor

heads up @gflachs

A recent PR that changes the folder structure has been merged. This affects your PR so you need to sync with the main branch.

Happy to help if you need it :)

@maryliag
Copy link
Contributor

maryliag commented Jul 9, 2025

I updated the default value of maskStatement to false, and made the correspondent changes in the doc and PR description.
That was the only remaining change for this PR, and it should be good to merge now

@pichlermarc pichlermarc merged commit bcf32cd into open-telemetry:main Jul 10, 2025
24 checks passed
@dyladan dyladan mentioned this pull request Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants