Incorrect SQL query rewriting

In the `SQLReader` class, the `getRowCount` method is used to determine the number of results for a given query. My understanding is that this is done by rewriting the incoming query as a SQL `COUNT` query, delegating the operation to the database itself rather than iterating through the entire `ResultSet`, for performance reasons:

[https://github.com/cefriel/mapping-template/blob/4e6129f1a91e94a2c080224230a31c607ca26c02/src/main/java/com/cefriel/template/io/sql/SQLReader.java#L252-L264](https://github.com/cefriel/mapping-template/blob/4e6129f1a91e94a2c080224230a31c607ca26c02/src/main/java/com/cefriel/template/io/sql/SQLReader.java#L252-L264)

The resulting `rowCount` is then used in the `populateDataframe` method to pre-allocate the data structure:

[https://github.com/cefriel/mapping-template/blob/4e6129f1a91e94a2c080224230a31c607ca26c02/src/main/java/com/cefriel/template/io/sql/SQLReader.java#L124-L128](https://github.com/cefriel/mapping-template/blob/4e6129f1a91e94a2c080224230a31c607ca26c02/src/main/java/com/cefriel/template/io/sql/SQLReader.java#L124-L128)

The issue is that the logic for rewriting the query is brittle. For example, the following query with an `ORDER BY` clause:

```sql
SELECT *
FROM patient
ORDER BY id
```

gets rewritten to the following **incorrect** query:

```sql
SELECT COUNT(*) FROM patient
ORDER BY id
```

which fails with:
`column "patient.id" must appear in the GROUP BY clause or be used in an aggregate function`

I see two potential approaches to fix this:

1. **Accept the performance hit and iterate through the `ResultSet` to get the row count.** This may reduce performance, but without benchmarking it is unclear how significant the impact would be. The upside is that this guarantees the row count can always be retrieved for any valid user query.
2. **Invest effort in improving the query rewriting logic.**

I lean toward option 1, as fully handling all edge cases in query rewriting would be complex and not a current priority. If performance issues arise in real usage, we can revisit this in the future. For now, option 1 seems “good enough.”

WDYT @marioscrock?

	private int getRowCount(String query) {
	int fromIndex = query.toUpperCase().indexOf("FROM");
	String countQuery = "SELECT COUNT(*) " + query.substring(fromIndex);
	try (ResultSet resultSet = executeQuery(countQuery)) {
	if (resultSet.next())
	return resultSet.getInt(1);
	else
	return 0;
	} catch (SQLException e) {
	log.error(e.getMessage(), e);
	}
	return 0;
	}

	private List<Map<String, String>> populateDataframe(int rowCount, ResultSet resultSet, String filterVariables) throws SQLException {
	ResultSetMetaData metaData = resultSet.getMetaData();
	int columnCount = metaData.getColumnCount();

	Collection<Map<String, String>> dataframe = onlyDistinct ? new HashSet<>(rowCount) : new ArrayList<>(rowCount);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect SQL query rewriting #59

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect SQL query rewriting #59

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions