Improve export speed

The current experimental fields-exporter in solrWayback issues standard `cursorMark` requests against the underlying Solr.

Assume we have 100 Solr nodes and page 1000 documents at a time (a reasonable approximation of the index at the Royal Danish Library which has 143 nodes).

In a classic Solr Cloud setup each cursorMark request for 1000 documents will cause processing at each Solr node, where each node will return 1000 document-IDs and sort values, but only 10 will be used on average (1000 documents / 100 nodes = 10 documents/node). This means there is a lot of wasted processing for each request. A local test gave an export with high variance in speed, from 12-130 records/second and high load across the Solr Cloud.

If the Solr nodes were queried sequentially, each cursorMark request would involve only a single Solr node and all 1000 documents returned would be used. This was tested against the setup at [Netarkivet](https://www.kb.dk/find-materiale/samlinger/netarkivet) and the export speed here was ~1200 documents/second with load localized to a single Solr node. It would be quite feasible to query 1/4 of all nodes on parallel, which would increase export speed to ~30,000 documents/second in our not-so-hypothetical setup.

There is clearly much to gain by rewiring the SolrWayback export code. Two possibilities comes to mind:

## External shard requests
Introduce an endpoint listing the shards in the Solr setup and add a `shardID` parameter to the export endpoint.

* Pro: Very easy to implement in SolrWayback
* Con: Puts the burden of merging (and grouping if that is required) the result on the caller

## Internal shard requests
Add parallel explicit shard requests to the existing code, taking care to support grouping. Basically a mimic of [Solr streaming](https://solr.apache.org/guide/8_8/streaming-expressions.html) with the crucial difference that export of stored fields is possible.

* Pro: Makes field export performant without any extra work for the end user
* Con: Not trivial to implement, needs consideration of Solr Cloud topology

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve export speed #329

External shard requests

Internal shard requests

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve export speed #329

Description

External shard requests

Internal shard requests

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions