[ArrowResultSet] Support LIMIT and OFFSET #199

Devjiu · 2023-02-20T11:47:38Z

This commit takes into account the limit and offset parameters in the ArrowResultSet.
Also added several tests to check limit/offset with ascending/descending
ordering and joins to ResultSetArrowConversion suit.

Resolves: #183

Signed-off-by: Dmitrii Makarenko [email protected]

omniscidb/QueryEngine/ArrowResultSetConverter.cpp

ienkovich · 2023-02-23T22:48:32Z

omniscidb/QueryEngine/ArrowResultSetConverter.cpp

-                                 ? results_->entryCount()
-                                 : std::min(size_t(top_n_), results_->entryCount());
+  // To get into a count all limit/offset parameters and make size, not index.
+  const size_t first_entry =


Iteration through the whole result set might not be the best way to count the number of rows. Let's look at different cases.

Columnar conversion. You just need a total number of rows to convert. ResultSet::rowCount should provide you with the actual value. It also has caching and more efficient parallel algorithms to get the value more efficiently.

Row-wise conversion. Here we have a loop that actually uses logical positions in the buffer for start and end indexes. That means, that we have to determine those positions in advance, and your solution should give the correct result. The problem here is that we would have a serial scan over the whole table just to get the number of rows and then make another scan (now in parallel) to build arrow buffers. That means for all cases (even when there are no OFFSET and LIMIT expressions) we have an additional serial scan. What we can do here is add versioning for row-wise conversion. Use the existing parallel_for loop when we have no limit and/or offset and go single-threaded when we have those parameters set. You can add offset and limit args to the convert_rowwise function for that.

omniscidb/Tests/ResultSetArrowConversion.cpp

ienkovich

This patch uses the right idea and should work for ResultSets with single storage. But I think it doesn't handle multiple storages correctly for columnar conversion. Also, tests don't include multiple storages case. Please, work in this direction.

ienkovich · 2023-03-03T19:38:09Z

omniscidb/QueryEngine/ResultSet.cpp

+  if (isTruncated()) {
+    rows_to_fetch = std::min(rows_to_fetch, keep_first_);
+  }
+  retval.emplace_back(storage_->getUnderlyingBuffer() +


This code is incorrect for cases when drop_first_ + keep_first_ > storage_->binSearchRowCount(). It would reference past the storage buffer. Big offsets might force you to skip one or more storages but your code doesn't do it.

I think it will be incorrect in case of drop_first_ >= storage_->binSearchRowCount(), if only sum is larger I am expecting data form offset to the end of table. (also I am checking it in OrderedLimitOverSizeSelection test)

The logic here is getting a little hard to understand. Maybe it's better to wrap in some method, not sure what's the best approach here.

storage_ doesn't hold the whole table, so it's legal to drop more rows than storage_ has.

Should be failed with an error or empty selection?

BTW, why it's illegal? We simply should skip first storage and use next one.

I didn't say it was illegal, I said the opposite. I mentioned missing storage skip in my first comment.

ienkovich · 2023-03-03T19:43:39Z

omniscidb/Tests/ResultSetArrowConversion.cpp

+  config().rs.enable_columnar_output = true;
+  config().rs.enable_lazy_fetch = false;
+  auto res = runSqlQuery(
+      "SELECT * FROM test INNER JOIN join_table ON test.i=join_table.i offset 4 limit 2;",


For all non-groupby tests use test_chunked or run tests twice, once on test and another one on test_chunked. The chunked table holds the same data but is split into different fragments. This will produce ResultSets with multiple storages and improve your testing coverage.

ienkovich · 2023-03-03T19:45:19Z

omniscidb/QueryEngine/ArrowResultSetConverter.cpp

+    if (is_truncated && seg_row_count <= offset) {
+      continue;
+    }
+    if (is_truncated && seg_row_count > offset + limit) {


We don't need to look for the next valid entry if converted enough rows. So please move this exit condition to the very beginning of the loop.

ienkovich · 2023-03-09T19:11:26Z

omniscidb/QueryEngine/ArrowResultSetConverter.cpp

  const auto col_count = results->colCount();
  CHECK_EQ(value_seg.size(), col_count);
  CHECK_EQ(null_bitmap_seg.size(), col_count);
  const auto local_entry_count = end_entry - start_entry;
  size_t seg_row_count = 0;
+  size_t limit = 0;
+  size_t offset = end_entry;


This default value is misleading. I know it is never used but still rises the question of why this value is used for initialization with no good answer. I guess simply initializing the limit and offset with values from results would be better.

ienkovich · 2023-03-09T19:23:40Z

omniscidb/QueryEngine/ResultSet.cpp

+  auto curr_storage_rows_to_fetch = storage_->binSearchRowCount();
+  size_t curr_storage_start_row = drop_first_;
+  size_t curr_storage_end_row =
+      keep_first_ != 0 ? keep_first_ : curr_storage_rows_to_fetch;


From the name of the variable, I guess it should hold the index past the last row to fetch. But this is not the case when the limit is set.

ienkovich · 2023-03-09T19:28:01Z

omniscidb/QueryEngine/ResultSet.cpp

+    curr_storage_start_row -= curr_storage_rows_to_fetch;
+  } else {
+    size_t to_fetch = std::min(curr_storage_rows_to_fetch - curr_storage_start_row,
+                               curr_storage_end_row - curr_storage_start_row);


curr_storage_end_row holds limit, curr_storage_start_row holds offset, so you compute limit - offset which is not what you need. Consider storage with 4 rows, limit=2, and offset=2. You would fetch 0 rows then. BTW it also means your tests do not have enough coverage here.

Agree, error can be triggered only when limit/offset in first storage. If it's in next one everything ok.

ienkovich · 2023-03-09T20:00:02Z

omniscidb/QueryEngine/ResultSet.cpp

+        storage_uptr->getUnderlyingBuffer() + storage_uptr->getColOffInBytes(column_idx);
+    size_t row_count = storage_uptr->binSearchRowCount();
+    curr_storage_rows_to_fetch = row_count;
+    if (curr_storage_end_row == 0)


curr_storage_end_row == 0 means we have to stop, right? If we got all the required rows from the first storage, then curr_storage_end_row should be 0 here and we should stop.

Overall, it's hard to read the code because of misleading variable names. curr_storage_rows_to_fetch doesn't hold the number of rows to fetch. curr_storage_start_row and curr_storage_end_row sound like iterators to something but in fact they are not. expected_row_count is named and used as if it holds a total number of rows to fetch, but it holds the total number of rows in the result set instead and (total_handled_rows >= expected_row_count) check is useless.

Could you please just use rows_to_fetch and rows_to_skip variables instead of the 5 variables you currently have? Initialize them like

size_t rows_to_skip = getOffset(); size_t rows_to_fetch = getLimit() ? std::min(getLimit(), rowCount() - getOffset()) : rowCount() - getOffset();

Then use them to compute a number of rows to skip/fetch from each storage and adjust them accordingly, stopping when rows_to_fetch is zero. I believe that would make the code more readable.

Also, please pay attention to your tests. Try to trigger current errors in tests before fixing them.

ienkovich · 2023-03-09T20:07:07Z

omniscidb/Tests/ResultSetArrowConversion.cpp

+  auto res = runSqlQuery(
+      "SELECT * FROM test_chunked offset 3 limit 7;", ExecutorDeviceType::CPU, true);
+
+  ASSERT_EQ(res.getRows()->rowCount(), (int64_t)3);


I definitely see problems in the current columnar conversion implementation. That makes me think these tests don't actually give you the required result set layouts. Please add asserts for isChunkedZeroCopyColumnarConversionPossible(), getLimit(), and getOffset() to make sure you have what you expect.

ienkovich

Looks good! Thanks!

ienkovich · 2023-03-15T18:57:38Z

omniscidb/Tests/ResultSetArrowConversion.cpp

@@ -145,7 +145,7 @@ void compare_columns(const std::array<TYPE, len>& expected,
  const arrow::ArrayVector& chunks = actual->chunks();

  TYPE null_val = null_builder<TYPE>();
-
+  auto total_row_count = 0;


You don't need to compute a number of rows, simply use actual->length().

agree, done

kurapov-peter

LGTM

This commit takes into account the limit and offset parameters in the ArrowResultSet. Also added several tests to check limit/offset with ascending/descending ordering and joins to ResultSetArrowConversion suit. Resolves: #183 Signed-off-by: Dmitrii Makarenko <[email protected]>

Devjiu marked this pull request as draft February 20, 2023 11:49

Devjiu requested a review from ienkovich February 20, 2023 11:49

Devjiu force-pushed the dmitriim/limit_offset_verification branch 3 times, most recently from 8a72bce to e6ae372 Compare February 23, 2023 15:19

Devjiu marked this pull request as ready for review February 23, 2023 21:45

ienkovich suggested changes Feb 23, 2023

View reviewed changes

Devjiu force-pushed the dmitriim/limit_offset_verification branch 4 times, most recently from 92c7119 to 7968a3b Compare March 3, 2023 11:52

Devjiu requested a review from ienkovich March 3, 2023 15:11

Devjiu linked an issue Mar 3, 2023 that may be closed by this pull request

ArrowResultSetConverter ignores LIMIT and OFFSET #183

Closed

ienkovich suggested changes Mar 3, 2023

View reviewed changes

Devjiu force-pushed the dmitriim/limit_offset_verification branch from 7968a3b to 2b6de3e Compare March 9, 2023 17:46

ienkovich suggested changes Mar 9, 2023

View reviewed changes

Devjiu force-pushed the dmitriim/limit_offset_verification branch 4 times, most recently from 9f9d04e to e68da90 Compare March 14, 2023 11:45

Devjiu requested review from alexbaden, kurapov-peter and ienkovich March 14, 2023 15:42

ienkovich approved these changes Mar 15, 2023

View reviewed changes

Devjiu force-pushed the dmitriim/limit_offset_verification branch 4 times, most recently from 50a19bd to f7ec4e3 Compare March 16, 2023 11:09

kurapov-peter approved these changes Mar 16, 2023

View reviewed changes

Devjiu force-pushed the dmitriim/limit_offset_verification branch from f7ec4e3 to 0c3c804 Compare March 16, 2023 14:31

Devjiu force-pushed the dmitriim/limit_offset_verification branch from 0c3c804 to 19562ca Compare March 16, 2023 14:33

Devjiu force-pushed the dmitriim/limit_offset_verification branch from 19562ca to 2331bae Compare March 16, 2023 14:42

kurapov-peter merged commit 31a62dc into main Mar 16, 2023

kurapov-peter deleted the dmitriim/limit_offset_verification branch March 16, 2023 14:46

[ArrowResultSet] Support LIMIT and OFFSET #199

[ArrowResultSet] Support LIMIT and OFFSET #199

Uh oh!

Conversation

Devjiu commented Feb 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ienkovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Devjiu Mar 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ienkovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurapov-peter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Devjiu commented Feb 20, 2023 •

edited

Loading

Devjiu Mar 14, 2023 •

edited

Loading