-
Notifications
You must be signed in to change notification settings - Fork 25
Add protocol for iterating keys in a store #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Introduces a new protocol `PKeyIterable`, which defines the method `-keys` that returns a channel yielding all toplevel keys in the store, in sorted order. Adds implementations of this for memory and filestore. This adds a function `konserve.core/keys`, which might not be the best name since it overrides `clojure.core/keys`. Alternate name suggestions are welcomed. * src/konserve/core.cljc (keys): new function. * src/konserve/filestore.clj (list-keys): do I/O in a thread. (FileSystemStore): add PKeyIterable protocol. * src/konserve/memory.cljc (MemoryStore): add PKeyIterable protocol. (new-mem-store): force state to be a sorted-map. * src/konserve/protocols.cljc: Add PKeyIterable protocol.
I was trying to think of ways to do node GC in hitchhiker-tree, and at least some way to iterate keys in the store is a starting point. There might be other approaches, and the keys output might not need to be sorted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. I think using the Clojure name for keys
is fine because we use the same names as for hash-maps intentionally. Usually I require konserve under the alias k
which makes its usage explicit.
* src/konserve/filestore.clj: remove 1-arg -keys. * src/konserve/memory.clj: remove 1-arg -keys; make keys iteration exclusive of start-key. * src/konserve/protocols.clj: remove 1-arg -keys; update docstring to mention that key iteration is exclusive of start-key.
src/konserve/key_compare.clj
Outdated
(ns konserve.key-compare | ||
"Comparator for arbitrary types.") | ||
|
||
(defn key-compare |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure if this is the right approach. I can't use clojure.core/compare
because that won't work with heterogeneous types. This will sort (e.g.) keywords before symbols, and symbols before strings, for example. This could be refined so that all named values sort together (e.g. :bar
< foo
< "quux"
, and say :bar
< bar
< "bar"
). But this might suffice.
Possibly just compare-by-edn is best: (compare (pr-str k1) (pr-str k2))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it may be fine to leave it up to the implementation to define sort order for heterogeneous types, as long as keys of the same type are in natural order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. The hitchhiker-tree implements a comparison protocol for edn https://github.com/replikativ/hitchhiker-tree/blob/master/src/hitchhiker/tree/key_compare.cljc that we build upon in Datahike. Why do you need the keys sorted?
Btw. there was work on a tracing GC for the hitchhiker-tree already, that I thought about building on https://github.com/replikativ/hitchhiker-tree/tree/tracing-gc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to change the ID format for hh-tree nodes in konserve to something like <hex-sequential-id>.<node-guid>
, where there is some kind of sequential ID added to each node address, and the GC process scans the storage in order, removing unreferenced addresses, until the sequential ID is >= the current sequential ID when the GC started. That way there's no need to worry about new addresses being added during the GC, because the sequential ID part will be larger. Having keys sorted would just help in stopping the scan as soon as a later identifier is encountered -- it's not strictly necessary, though. The same process would work even if keys aren't sorted.
The sequential ID could just be a value in the :db
key that is incremented on each flush, or a current timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense. Flushing then needs an additional argument in the hitchhiker-tree and we can just use any monotone lattice that provides a happened-before relation, e.g. a counter. That way we could GC even after merges of databases. Keeping a consistent active set can be tricky in a distributed system though, because we will ship the root nodes to reading client replicas, so maybe we want to use physical time on the transactor so that we can GC only the values that are older than some time window needed for clients to replicate index fragments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to do the sorting of keys just in memory after having loaded them and leave this implementation detail out of the keys protocol for now. Even in a large DB with millions of tree nodes (i.e. billions of datoms) the keys only have a size of maybe a few dozen megabytes in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was leaning towards using a timestamp, since that's monotonic without any coordination -- even though keeping accurate time in a distributed system is difficult, we only need it to be approximate, just enough so the GC doesn't remove newly added nodes.
I hear you about the sorting requirements in the protocol -- I'll remove them; it's not worth that effort for that kind of optimization. The GC doesn't even need to sort them anyway to work properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just keep an eye on forking and joining databases and which operations would make that hard. But timestamp with a conservatively large window should be fine.
* src/konserve/core.clj (keys): add note that order of types is implementation-dependent, but keys of the same type are in natural order. * src/konserve/key_compare.cljc: moved to cljc from clj. * src/konserve/prococols.cljc (-keys): add note about how types should be ordered.
Add some basic tests for keys call.
Fix -keys implementation in filestore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work. Thank you!
Introduces a new protocol
PKeyIterable
, which defines the method-keys
that returns a channel yielding all toplevel keys in the store, in sorted order. Adds implementations of this for memory and filestore.This adds a function
konserve.core/keys
, which might not be the best name since it overridesclojure.core/keys
. Alternate name suggestions are welcomed.(FileSystemStore): add PKeyIterable protocol.
(new-mem-store): force state to be a sorted-map.