jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Axel Hanikel (Jira)" <j...@apache.org>
Subject [jira] [Commented] (OAK-7932) A distributed node store for the cloud
Date Fri, 12 Mar 2021 12:15:00 GMT

    [ https://issues.apache.org/jira/browse/OAK-7932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300266#comment-17300266
] 

Axel Hanikel commented on OAK-7932:
-----------------------------------

h1. Current status
The "third implementation" described above works pretty well although it is still a bit slow.
The code is at [https://github.com/ahanikel/jackrabbit-oak/tree/zeromq-nodestore-kafka-finegrained-checkpoints-1.38.0].
The next steps would be to find the bottlenecks and to change the internal handling of nodestates.
At the moment they are "special" but they should actually be handled just like blobs, because
that's what they are. It would also make the code simpler.

> A distributed node store for the cloud
> --------------------------------------
>
>                 Key: OAK-7932
>                 URL: https://issues.apache.org/jira/browse/OAK-7932
>             Project: Jackrabbit Oak
>          Issue Type: Wish
>          Components: segment-tar
>            Reporter: Axel Hanikel
>            Assignee: Axel Hanikel
>            Priority: Minor
>
> h1. Outline
> This issue documents some proof-of-concept work for adapting the segment tar nodestore
to a
>  distributed environment. The main idea is to adopt an actor-like model, meaning:
> *   Communication between actors (services) is done exclusively via messages.
> *   An actor (which could also be a thread) processes one message at a time, avoiding
sharing
>      state with other actors as far as possible.
> *   Nodestates are kept in RAM and are written to external storage lazily only for
disaster recovery.
> *   A nodestate is identified by a uuid, which in turn is a hash on its serialised string
representation.
> *   As RAM is a very limited resource, different actors own their share of the total
uuid space.
> *   An actor might also cache a few nodestates which it does not own but which it uses
often (such as
>      the one containing the root node)
> h1. Implementation
> *The first idea* was to use the segment node store, and ZeroMQ for communication because
it seems to be a high-quality and
>  easy to use implementation. A major drawback is that the library is written in C and
the Java
>  library which does the JNI stuff seems hard to set up and did not work for me. There
is a native
>  Java implementation of the ZeroMQ protocol, aptly called jeromq, which seems to work
well so far. This approach is probably not
>  pursued because due to the nature of how things are stored in segments, they are hard
to cache (it seems like a large part of the repository
> will eventually end up in the cache).
> *A second implementation*, at [https://github.com/ahanikel/jackrabbit-oak/tree/zeromq-nodestore]
is a simple
> nodestore implementation which is kind of a dual to the segment store in the sense that
it is on the other end
> of the compactness spectrum. The segment store is very dense and avoids duplication whereever
possible.
> The nodestore in this implementation, however, is quite redundant: Every nodestate gets
its own UUID (a hash of the serialised
> nodestate) and is saved together with its properties, similar to the document node store.
> Here is what a serialised nodestate looks like:
> {noformat}
> begin ZeroMQNodeState
> begin children
> allow	856d1356-7054-3993-894b-a04426956a78
> end children
> begin properties
> jcr:primaryType <NAME> = rep:ACL
> :childOrder <NAMES> = [allow]
> end properties
> end ZeroMQNodeState
> {noformat}
> This verbose format is good for debugging but expensive to generate and parse, so it
may be replaced with a binary format at some point. But it shows how child nodestates are
referenced and how properties are represented. Binary properties (not shown here) are represented
by a reference to the blob store.
> The redundancy (compared with the segment store with its fine-grained record structure)
wastes space, but on the other hand garbage
> collection (yet unimplemented) is easier because there is no segment that needs to be
rewritten to get rid of data that is no
> longer referenced; unreferenced nodes can just be deleted. This implementation still
has bugs, but being much simpler
> than the segment store, it can eventually be used to experiment with different configurations
and examine their
> performance.
>  *A third implementation*, at [https://github.com/ahanikel/jackrabbit-oak/tree/zeromq-nodestore-kafka-finegrained-checkpoints-1.38.0]
is based on the second one but uses a more compact serialisation format, which records the
differences between a new nodestate and its base state, similar to what {{.compareAgainstBaseState()}}
produces in code. It looks like this:
> {noformat}
> n: 823f2252-db37-b0ca-3f7e-09cd073b530a 00000000-0000-0000-0000-000000000000
> n+ root 00000000-0000-0000-0000-000000000000
> n+ checkpoints 00000000-0000-0000-0000-000000000000
> n+ blobs 00000000-0000-0000-0000-000000000000
> n!
> journal golden 823f2252-db37-b0ca-3f7e-09cd073b530a
> checkpoints golden 00000000-0000-0000-0000-000000000000
> n: 394cdbf7-c75b-1feb-23f9-2ca93a8d97eb 00000000-0000-0000-0000-000000000000
> p+ :clusterId <STRING> = 2441f7f8-5ce8-493f-9b27-c4bca802ea0b
> n!
> n: 794ccfa2-4372-5b05-c54f-d2b6b6d2df03 00000000-0000-0000-0000-000000000000
> n+ :clusterConfig 394cdbf7-c75b-1feb-23f9-2ca93a8d97eb
> n!
> n: fe9f9008-7734-bd69-17d0-ccecf4003323 823f2252-db37-b0ca-3f7e-09cd073b530a
> n^ root 794ccfa2-4372-5b05-c54f-d2b6b6d2df03 00000000-0000-0000-0000-000000000000
> n!
> journal golden fe9f9008-7734-bd69-17d0-ccecf400332
> b64+ 516f4f64-4e5c-432e-b9bf-0267c1e96541
> b64d ClVVRU5DT0RFKDEpICAgICAgICAgICAgICAgQlNEIEdlbmVyYWwgQ29tbWFuZHMgTWFudWFsICAg
> b64d ICAgICAgICAgICBVVUVOQ09ERSgxKQoKTghOQQhBTQhNRQhFCiAgICAgdQh1dQh1ZAhkZQhlYwhj
> b64d bwhvZAhkZQhlLCB1CHV1CHVlCGVuCG5jCGNvCG9kCGRlCGUgLS0gZW5jb2RlL2RlY29kZSBhIGJp
> b64d bmFyeSBmaWxlCgpTCFNZCFlOCE5PCE9QCFBTCFNJCElTCFMKICAgICB1CHV1CHVlCGVuCG5jCGNv
> b64d CG9kCGRlCGUgWy0ILW0IbV0gWy0ILW8IbyBfCG9fCHVfCHRfCHBfCHVfCHRfCF9fCGZfCGlfCGxf
> b64d CGVdIFtfCGZfCGlfCGxfCGVdIF8Ibl8IYV8IbV8IZQogICAgIHUIdXUIdWQIZGUIZWMIY28Ib2QI
> b64d ZGUIZSBbLQgtYwhjaQhpcAhwcwhzXSBbXwhmXwhpXwhsXwhlIF8ILl8ILl8ILl0KICAgICB1CHV1
> b64d CHVkCGRlCGVjCGNvCG9kCGRlCGUgWy0ILWkIaV0gLQgtbwhvIF8Ib18IdV8IdF8IcF8IdV8IdF8I
> b64d X18IZl8IaV8IbF8IZSBbXwhmXwhpXwhsXwhlXQo=
> b64!
> {noformat}
> n: starts a nodestate. n+  and n^  both reference a nodestate which has been defined
before. The first one says that a node has been created, the second one means an existing
node has been changed, indicating the nodestate's uuid and the uuid of the basestate. This
is slightly different from the format described in OAK-7849 which doesn't record the uuids.
b64+ adds a new base64 encoded blob with the given uuid followed by the encoded data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message