Skip to content

Incorrect SQL query rewriting #59

@AuPath

Description

@AuPath

In the SQLReader class, the getRowCount method is used to determine the number of results for a given query. My understanding is that this is done by rewriting the incoming query as a SQL COUNT query, delegating the operation to the database itself rather than iterating through the entire ResultSet, for performance reasons:

private int getRowCount(String query) {
int fromIndex = query.toUpperCase().indexOf("FROM");
String countQuery = "SELECT COUNT(*) " + query.substring(fromIndex);
try (ResultSet resultSet = executeQuery(countQuery)) {
if (resultSet.next())
return resultSet.getInt(1);
else
return 0;
} catch (SQLException e) {
log.error(e.getMessage(), e);
}
return 0;
}

The resulting rowCount is then used in the populateDataframe method to pre-allocate the data structure:

private List<Map<String, String>> populateDataframe(int rowCount, ResultSet resultSet, String filterVariables) throws SQLException {
ResultSetMetaData metaData = resultSet.getMetaData();
int columnCount = metaData.getColumnCount();
Collection<Map<String, String>> dataframe = onlyDistinct ? new HashSet<>(rowCount) : new ArrayList<>(rowCount);

The issue is that the logic for rewriting the query is brittle. For example, the following query with an ORDER BY clause:

SELECT *
FROM patient
ORDER BY id

gets rewritten to the following incorrect query:

SELECT COUNT(*) FROM patient
ORDER BY id

which fails with:
column "patient.id" must appear in the GROUP BY clause or be used in an aggregate function

I see two potential approaches to fix this:

  1. Accept the performance hit and iterate through the ResultSet to get the row count. This may reduce performance, but without benchmarking it is unclear how significant the impact would be. The upside is that this guarantees the row count can always be retrieved for any valid user query.
  2. Invest effort in improving the query rewriting logic.

I lean toward option 1, as fully handling all edge cases in query rewriting would be complex and not a current priority. If performance issues arise in real usage, we can revisit this in the future. For now, option 1 seems “good enough.”

WDYT @marioscrock?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions