jackrabbit-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Jackrabbit Wiki] Update of "Oakathon November 2017" by MattRyan
Date Wed, 15 Nov 2017 08:46:50 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The "Oakathon November 2017" page has been changed by MattRyan:
https://wiki.apache.org/jackrabbit/Oakathon%20November%202017?action=diff&rev1=37&rev2=38

Comment:
Added notes from Glacier data store discussion.

  
  Finally, we discussed what we may consider to be the first use case of this capability in
Oak.  Initially Matt proposed that allowing the CompositeDataStore to select a delegate based
on path information may be the first use case.  Another suggestion (Amit? Vikas?) was that
a smaller use case might exist just within Oak to use a CompositeDataStore and store only
index segments in one delegate and everything else in the other.  In that case this would
happen entirely within Oak and the user would not be aware that a CompositeDataStore was being
used.
  
+ == Handling requests for blobs that aren't immediately available ==
+ The prime example for this scenario is using AWS Glacier as an Oak data store option.  Glacier
as a data store doesn't make a lot of sense by itself but if used in the context of CompositeDataStore
with some support for tiering or prioritization in the data stores, Glacier as the lowest
priority, it might make sense.
+ 
+ The biggest challenge with using Glacier is that unarchiving blobs from Glacier is a time-consuming
task.  The standard expectation is 4+ hours; expedited extraction is possible but even in
this case retrieval is on the order of minutes.  This clearly means we would need some mechanism
for conveying "I can get the requested blob; I don't have it now, but I will have it in the
future."  Once a blob is unarchived it must be retrieved within 24 hours or it will return
to the archived state.
+ 
+ A new component (referred to as a "curator" in this discussion) was suggested.  The role
of the curator would be to retrieve unarchived objects from Glacier to a higher-level storage
tier for future access.  As proposed it would also have the role of applying policy to move
infrequently accessed objects to lower tiers, eventually to Glacier.  Because the curator
moves objects, it knows where they are and when they have moved so Oak continues to know the
whereabouts.  If they are moved outside of Oak, it becomes difficult for Oak to keep track
of blobs and their locations which could result in the composite data store requesting a blob
from delegates where it doesn't exist anymore.
+ 
+ The following items were discussed:
+  * Should Glacier restore be an administrative task instead of something that occurs as
the result of a standard user request?  Since Glacier restores are expensive, there is risk
that spurious user requests for blobs could result in unnecessary restores.  This would mean
that the responsibility of unarchiving would reside outside of Oak, either at the application
level or we could assume users simply do it via their own AWS console or something.
+  * In the context of tiered storage in a composite data store, Glacier storage is perhaps
not useful unless S3DataStore is also being used, so we can probably assume S3DataStore. 
In which case, we can make assumptions about unarchiving.  For example, an AWS Lambda could
be used to move code from one store to another.
+  * How does garbage collection work across multiple stores?  (Open issue)
+  * Curator component probably belongs within the context of oak-blob-composite, not as a
separate bundle.
+  * S3 IA has had some very rudimentary testing with Oak and should be much simpler to use.
 Would this be sufficient to meet a user desire for lower-cost storage and thus minimize the
need to take on the additional complexity of using Glacier?
+  * How do we get the curator to avoid archiving things that don't want to be archived? 
There may be some blobs that we never want to archive.  Some suggestions:
+   * Size (some concerns with this; while it is easy to determine whether a blob exceeds
the minimum size requirement, it is much harder to come up with a meaningful size.  For example,
blob thumbnails should probably never be archived for user experience purposes, but what is
the correct size that would include every conceivable thumbnail but not exclude archiving
things that we want to archive?)
+   * Last accessed time
+   * Some other items, like index segments, should never be archived
+   * Could we only archive certain parts of the tree?
+  * The blob store deduplicates blobs, which means multiple nodes may refer to the same blob.
 So the curator would need to be aware of all nodes referring to a blob and only archive if
all nodes agree it should be archived.  One idea given was to use a similar pattern as for
garbage collection, e.g. during a mark phase blobs can be marked as "don't move" if any node
votes it shouldn't be moved to a lower priority tier.
+  * Once something has been moved, instead of removing the reference at the higher tier the
record could be replaced with a placeholder indicating where it moved to.
+ 
+ Toward the end of the discussion, it was brought up that perhaps we could get by without
a Glacier data store delegate for composite data store and still allow users to store things
in Glacier.  It would require customization outside of Oak which could either be an application
or simple administrative tasks.  In such a case, as an example "archiving" a blob would mean
it doesn't get deleted from a higher tier blob store, but rather gets replaced with some sort
of marker that the user can understand as "archived".  The custom code would copy the blob
to archive.  Unarchiving would require external effort to revert the process.  So in other
words the "application layer" (which could also just be admin scripts) is responsible for
moving blobs from one tier to the other, as well as keeping track of the state of blobs. 
Difficulty comes in determining which blob is the one that should be moved and knowing whether
it can be moved, dealing with multiple references, etc. but it is certainly possible to be
done outside of Oak.
+ 
+ Action plan:  MattR to explore the need further and determine if using cold storage options
is a real requirement and if doing it in Oak is really needed.
+ 

Mime
View raw message