-
Notifications
You must be signed in to change notification settings - Fork 26
Description
The current experimental fields-exporter in solrWayback issues standard cursorMark
requests against the underlying Solr.
Assume we have 100 Solr nodes and page 1000 documents at a time (a reasonable approximation of the index at the Royal Danish Library which has 143 nodes).
In a classic Solr Cloud setup each cursorMark request for 1000 documents will cause processing at each Solr node, where each node will return 1000 document-IDs and sort values, but only 10 will be used on average (1000 documents / 100 nodes = 10 documents/node). This means there is a lot of wasted processing for each request. A local test gave an export with high variance in speed, from 12-130 records/second and high load across the Solr Cloud.
If the Solr nodes were queried sequentially, each cursorMark request would involve only a single Solr node and all 1000 documents returned would be used. This was tested against the setup at Netarkivet and the export speed here was ~1200 documents/second with load localized to a single Solr node. It would be quite feasible to query 1/4 of all nodes on parallel, which would increase export speed to ~30,000 documents/second in our not-so-hypothetical setup.
There is clearly much to gain by rewiring the SolrWayback export code. Two possibilities comes to mind:
External shard requests
Introduce an endpoint listing the shards in the Solr setup and add a shardID
parameter to the export endpoint.
- Pro: Very easy to implement in SolrWayback
- Con: Puts the burden of merging (and grouping if that is required) the result on the caller
Internal shard requests
Add parallel explicit shard requests to the existing code, taking care to support grouping. Basically a mimic of Solr streaming with the crucial difference that export of stored fields is possible.
- Pro: Makes field export performant without any extra work for the end user
- Con: Not trivial to implement, needs consideration of Solr Cloud topology