Skip to content

Improve export speed #329

@tokee

Description

@tokee

The current experimental fields-exporter in solrWayback issues standard cursorMark requests against the underlying Solr.

Assume we have 100 Solr nodes and page 1000 documents at a time (a reasonable approximation of the index at the Royal Danish Library which has 143 nodes).

In a classic Solr Cloud setup each cursorMark request for 1000 documents will cause processing at each Solr node, where each node will return 1000 document-IDs and sort values, but only 10 will be used on average (1000 documents / 100 nodes = 10 documents/node). This means there is a lot of wasted processing for each request. A local test gave an export with high variance in speed, from 12-130 records/second and high load across the Solr Cloud.

If the Solr nodes were queried sequentially, each cursorMark request would involve only a single Solr node and all 1000 documents returned would be used. This was tested against the setup at Netarkivet and the export speed here was ~1200 documents/second with load localized to a single Solr node. It would be quite feasible to query 1/4 of all nodes on parallel, which would increase export speed to ~30,000 documents/second in our not-so-hypothetical setup.

There is clearly much to gain by rewiring the SolrWayback export code. Two possibilities comes to mind:

External shard requests

Introduce an endpoint listing the shards in the Solr setup and add a shardID parameter to the export endpoint.

  • Pro: Very easy to implement in SolrWayback
  • Con: Puts the burden of merging (and grouping if that is required) the result on the caller

Internal shard requests

Add parallel explicit shard requests to the existing code, taking care to support grouping. Basically a mimic of Solr streaming with the crucial difference that export of stored fields is possible.

  • Pro: Makes field export performant without any extra work for the end user
  • Con: Not trivial to implement, needs consideration of Solr Cloud topology

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions