In the SQLReader class, the getRowCount method is used to determine the number of results for a given query. My understanding is that this is done by rewriting the incoming query as a SQL COUNT query, delegating the operation to the database itself rather than iterating through the entire ResultSet, for performance reasons:
|
private int getRowCount(String query) { |
|
int fromIndex = query.toUpperCase().indexOf("FROM"); |
|
String countQuery = "SELECT COUNT(*) " + query.substring(fromIndex); |
|
try (ResultSet resultSet = executeQuery(countQuery)) { |
|
if (resultSet.next()) |
|
return resultSet.getInt(1); |
|
else |
|
return 0; |
|
} catch (SQLException e) { |
|
log.error(e.getMessage(), e); |
|
} |
|
return 0; |
|
} |
The resulting rowCount is then used in the populateDataframe method to pre-allocate the data structure:
|
private List<Map<String, String>> populateDataframe(int rowCount, ResultSet resultSet, String filterVariables) throws SQLException { |
|
ResultSetMetaData metaData = resultSet.getMetaData(); |
|
int columnCount = metaData.getColumnCount(); |
|
|
|
Collection<Map<String, String>> dataframe = onlyDistinct ? new HashSet<>(rowCount) : new ArrayList<>(rowCount); |
The issue is that the logic for rewriting the query is brittle. For example, the following query with an ORDER BY clause:
SELECT *
FROM patient
ORDER BY id
gets rewritten to the following incorrect query:
SELECT COUNT(*) FROM patient
ORDER BY id
which fails with:
column "patient.id" must appear in the GROUP BY clause or be used in an aggregate function
I see two potential approaches to fix this:
- Accept the performance hit and iterate through the
ResultSet to get the row count. This may reduce performance, but without benchmarking it is unclear how significant the impact would be. The upside is that this guarantees the row count can always be retrieved for any valid user query.
- Invest effort in improving the query rewriting logic.
I lean toward option 1, as fully handling all edge cases in query rewriting would be complex and not a current priority. If performance issues arise in real usage, we can revisit this in the future. For now, option 1 seems “good enough.”
WDYT @marioscrock?
In the
SQLReaderclass, thegetRowCountmethod is used to determine the number of results for a given query. My understanding is that this is done by rewriting the incoming query as a SQLCOUNTquery, delegating the operation to the database itself rather than iterating through the entireResultSet, for performance reasons:mapping-template/src/main/java/com/cefriel/template/io/sql/SQLReader.java
Lines 252 to 264 in 4e6129f
The resulting
rowCountis then used in thepopulateDataframemethod to pre-allocate the data structure:mapping-template/src/main/java/com/cefriel/template/io/sql/SQLReader.java
Lines 124 to 128 in 4e6129f
The issue is that the logic for rewriting the query is brittle. For example, the following query with an
ORDER BYclause:gets rewritten to the following incorrect query:
which fails with:
column "patient.id" must appear in the GROUP BY clause or be used in an aggregate functionI see two potential approaches to fix this:
ResultSetto get the row count. This may reduce performance, but without benchmarking it is unclear how significant the impact would be. The upside is that this guarantees the row count can always be retrieved for any valid user query.I lean toward option 1, as fully handling all edge cases in query rewriting would be complex and not a current priority. If performance issues arise in real usage, we can revisit this in the future. For now, option 1 seems “good enough.”
WDYT @marioscrock?