jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Mueller (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-3122) Direct copy of the chunked data between blob stores
Date Fri, 24 Jul 2015 07:43:04 GMT

    [ https://issues.apache.org/jira/browse/OAK-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640089#comment-14640089

Thomas Mueller commented on OAK-3122:

Sorry for the delay, I did not see this issue before.

Some remarks (I guess you know this already, just to be clear): 

* For the FileBlobStore and MongoBlobStore, the length is encoded in the blobId as well, but
using a special encoding and not "<id>#<length>". The length of the id is variable.

* For the FileDataStore, the length is encoded in the id, using the format "<id>#<length>".
This is relatively new code and was not there in Jackrabbit 2.x originally. Ids also don't
have a fixed length really: for entries that are stored in a file, format "<id>#<length>",
id has a fixed length, but small entries (not stored in a file) have a variable length. It
can be quite large.

> Most people use FileDataStore instead of FileBlobStore.

Yes, that's true. The reasons for this (AFAIK) are: (a) simple upgrade from Jackrabbit 2.x
without having to migrate binaries, and (b) use proven code. But well with the addition of
the content length to the id ('<id>#<length>') the code also changed, so (b) is
not really true any longer. And now the MongoBlobStore is proven code as well. 

So the question is, should we try to support FileDataStore for this use case. I'm not sure.
It would be simpler to use the FileBlobStore. Using a persistent mapping (a map from one id
to the other) might be a performance problem, depending on the size of the map, because blob
ids have _very_ bad locality. As for risk of corruption, if the map is immutable once created,
then it should be fine, but I wouldn't want to support writing to the map *after* migration.

What we could do, not sure if it's a good idea or not, is to use a "SplitBlobStore" that reads
data, depending on the id, from the FileBlobStore or FileDataStore, (or other store, for example
MongoBlobStore), but writes new entries always to the FileDataStore (or FileBlobStore, or
other). That way, people could slowly migrate from one to the other store. 

Both the "mapping file approach" and the "SplitBlobStore" approach could be combined.

> Direct copy of the chunked data between blob stores
> ---------------------------------------------------
>                 Key: OAK-3122
>                 URL: https://issues.apache.org/jira/browse/OAK-3122
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: core, mongomk, upgrade
>            Reporter: Tomek Rękawek
>             Fix For: 1.4
> It could be useful to have a tool that allows to copy blob chunks directly between different
stores, so users can quickly migrate their data, without a need to touch the node store, consolidate
binaries, etc.
> Such tool should have direct access to the methods operating on the binary blocks, implemented
in the {{AbstractBlobStore}} and its subtypes:
> {code}
> void storeBlock(byte[] digest, int level, byte[] data);
> byte[] readBlockFromBackend(BlockId blockId);
> Iterator<String> getAllChunkIds(final long maxLastModifiedTime);
> {code}
> My proposal is to create a {{ChunkedBlobStore}} interface containing these methods, which
can be implemented by {{FileBlobStore}} and {{MongoBlobStore}}.
> Then we can enumerate all chunk ids, read the underlying blocks from source blob store
and save them in the destination.

This message was sent by Atlassian JIRA

View raw message