Skip to content

Incomplete #1

@missinglink

Description

@missinglink

This lib is currently incomplete, although it is not far off being worthy of publishing.

This lib stands to replace both pelias/dbclient and the older pelias/esclient modules.

The key points of differentiation from other streaming elasticsearch indexers are:

  • batching via the bulk API
  • retry failed batches
  • flooding upstream propagating downstream (most important)

Other libraries are not well suited for large datasets containing complex properties (such as country size polygons) which take some time to process on the java-side, as a result, naive indexers cause elasticsearch to fill up the bulk indexing threadpool which results in those batches being rejected and data loss.

What's left to do:

  • Write readme and explain how concurrency, retries and the cli work
  • Rethink and test the concurrency control mechanism to achieve optimum load
  • Refactor some of the code to emit events
  • Write a stats module which captures Transaction events and emits stat digests.
Module Goals:

☑ batched writes
☑ adjustable batch size
☑ partialy retry failed batches
☑ backpressure (flood control)
☑ concurrency setting, better highwatermark
☐ actionable error reporting
☑ elasticsearch client injectable
☑ well tested via unit tests & in production
☑ bin file, input streams from cli with id, type mapper
☑ minimal dependencies, dependency injection
☑ usable outside pelias project & not strictly tied to pelias config
☑ ensure no data loss due to ES errors or failure to flush batches
☐ healthcheck via threadpool status
☐ compatibility with different nodejs stream versions
☑ better logging - via winston

Issues with dbclient:

☑ badly named, doesnt describe purpose
☑ not abstracted from pelias
☑ strictly dependency on other pelias modules
☑ not generally useful to 3rd parties
☑ difficult for 3rd party developers to contribute
☑ untidy code
☑ not fully unit tested
☐ not well documented

Duplication across modules (causing confusion):

- https://github.com/geopipes/elasticsearch-backend
- https://github.com/pelias/esclient
- https://github.com/pelias/dbclient

Dependants:

- dat-elasticsearch-upload
- pelias-geonames
- pelias-openaddresses
- pelias-openstreetmap

Similar projects / implementations:

https://github.com/hmalphettes/elasticsearch-streams
https://www.npmjs.com/package/elasticstream
https://github.com/simianhacker/bunyan-elasticsearch/blob/master/index.js

running unit tests

$> npm test

running integration tests

$> npm run integration

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions