Skip to content

Add protocol for iterating keys in a store #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Nov 19, 2019
Merged

Conversation

csm
Copy link

@csm csm commented Nov 15, 2019

Introduces a new protocol PKeyIterable, which defines the method -keys that returns a channel yielding all toplevel keys in the store, in sorted order. Adds implementations of this for memory and filestore.

This adds a function konserve.core/keys, which might not be the best name since it overrides clojure.core/keys. Alternate name suggestions are welcomed.

  • src/konserve/core.cljc (keys): new function.
  • src/konserve/filestore.clj (list-keys): do I/O in a thread.
    (FileSystemStore): add PKeyIterable protocol.
  • src/konserve/memory.cljc (MemoryStore): add PKeyIterable protocol.
    (new-mem-store): force state to be a sorted-map.
  • src/konserve/protocols.cljc: Add PKeyIterable protocol.

Introduces a new protocol `PKeyIterable`, which defines the method
`-keys` that returns a channel yielding all toplevel keys in the store,
in sorted order. Adds implementations of this for memory and filestore.

This adds a function `konserve.core/keys`, which might not be the best
name since it overrides `clojure.core/keys`. Alternate name suggestions
are welcomed.

* src/konserve/core.cljc (keys): new function.
* src/konserve/filestore.clj (list-keys): do I/O in a thread.
  (FileSystemStore): add PKeyIterable protocol.
* src/konserve/memory.cljc (MemoryStore): add PKeyIterable protocol.
  (new-mem-store): force state to be a sorted-map.
* src/konserve/protocols.cljc: Add PKeyIterable protocol.
@csm
Copy link
Author

csm commented Nov 15, 2019

I was trying to think of ways to do node GC in hitchhiker-tree, and at least some way to iterate keys in the store is a starting point. There might be other approaches, and the keys output might not need to be sorted.

Copy link
Member

@whilo whilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I think using the Clojure name for keys is fine because we use the same names as for hash-maps intentionally. Usually I require konserve under the alias k which makes its usage explicit.

* src/konserve/filestore.clj: remove 1-arg -keys.
* src/konserve/memory.clj: remove 1-arg -keys; make keys iteration
  exclusive of start-key.
* src/konserve/protocols.clj: remove 1-arg -keys; update docstring to
  mention that key iteration is exclusive of start-key.
(ns konserve.key-compare
"Comparator for arbitrary types.")

(defn key-compare
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if this is the right approach. I can't use clojure.core/compare because that won't work with heterogeneous types. This will sort (e.g.) keywords before symbols, and symbols before strings, for example. This could be refined so that all named values sort together (e.g. :bar < foo < "quux", and say :bar < bar < "bar"). But this might suffice.

Possibly just compare-by-edn is best: (compare (pr-str k1) (pr-str k2))?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it may be fine to leave it up to the implementation to define sort order for heterogeneous types, as long as keys of the same type are in natural order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. The hitchhiker-tree implements a comparison protocol for edn https://github.com/replikativ/hitchhiker-tree/blob/master/src/hitchhiker/tree/key_compare.cljc that we build upon in Datahike. Why do you need the keys sorted?

Btw. there was work on a tracing GC for the hitchhiker-tree already, that I thought about building on https://github.com/replikativ/hitchhiker-tree/tree/tracing-gc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to change the ID format for hh-tree nodes in konserve to something like <hex-sequential-id>.<node-guid>, where there is some kind of sequential ID added to each node address, and the GC process scans the storage in order, removing unreferenced addresses, until the sequential ID is >= the current sequential ID when the GC started. That way there's no need to worry about new addresses being added during the GC, because the sequential ID part will be larger. Having keys sorted would just help in stopping the scan as soon as a later identifier is encountered -- it's not strictly necessary, though. The same process would work even if keys aren't sorted.

The sequential ID could just be a value in the :db key that is incremented on each flush, or a current timestamp.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense. Flushing then needs an additional argument in the hitchhiker-tree and we can just use any monotone lattice that provides a happened-before relation, e.g. a counter. That way we could GC even after merges of databases. Keeping a consistent active set can be tricky in a distributed system though, because we will ship the root nodes to reading client replicas, so maybe we want to use physical time on the transactor so that we can GC only the values that are older than some time window needed for clients to replicate index fragments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to do the sorting of keys just in memory after having loaded them and leave this implementation detail out of the keys protocol for now. Even in a large DB with millions of tree nodes (i.e. billions of datoms) the keys only have a size of maybe a few dozen megabytes in memory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was leaning towards using a timestamp, since that's monotonic without any coordination -- even though keeping accurate time in a distributed system is difficult, we only need it to be approximate, just enough so the GC doesn't remove newly added nodes.

I hear you about the sorting requirements in the protocol -- I'll remove them; it's not worth that effort for that kind of optimization. The GC doesn't even need to sort them anyway to work properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just keep an eye on forking and joining databases and which operations would make that hard. But timestamp with a conservatively large window should be fine.

* src/konserve/core.clj (keys): add note that order of types is
  implementation-dependent, but keys of the same type are in natural
  order.
* src/konserve/key_compare.cljc: moved to cljc from clj.
* src/konserve/prococols.cljc (-keys): add note about how types should
  be ordered.
Add some basic tests for keys call.
Fix -keys implementation in filestore.
Copy link
Member

@whilo whilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work. Thank you!

@whilo whilo merged commit 4750a64 into replikativ:master Nov 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants