jackrabbit-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mreut...@apache.org
Subject svn commit: r1835390 [8/23] - in /jackrabbit/site/live/oak/docs: ./ architecture/ coldstandby/ features/ nodestore/ nodestore/document/ nodestore/segment/ oak-mongo-js/ oak_api/ plugins/ query/ security/ security/accesscontrol/ security/authentication/...
Date Mon, 09 Jul 2018 08:53:19 GMT
Modified: jackrabbit/site/live/oak/docs/plugins/blobstore.html
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/plugins/blobstore.html?rev=1835390&r1=1835389&r2=1835390&view=diff
==============================================================================
--- jackrabbit/site/live/oak/docs/plugins/blobstore.html (original)
+++ jackrabbit/site/live/oak/docs/plugins/blobstore.html Mon Jul  9 08:53:17 2018
@@ -1,13 +1,13 @@
 <!DOCTYPE html>
 <!--
- | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2018-05-24 
+ | Generated by Apache Maven Doxia Site Renderer 1.8.1 at 2018-07-09 
  | Rendered using Apache Maven Fluido Skin 1.6
 -->
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <head>
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <meta name="Date-Revision-yyyymmdd" content="20180524" />
+    <meta name="Date-Revision-yyyymmdd" content="20180709" />
     <meta http-equiv="Content-Language" content="en" />
     <title>Jackrabbit Oak &#x2013; The Blob Store</title>
     <link rel="stylesheet" href="../css/apache-maven-fluido-1.6.min.css" />
@@ -136,7 +136,7 @@
 
       <div id="breadcrumbs">
         <ul class="breadcrumb">
-        <li id="publishDate">Last Published: 2018-05-24<span class="divider">|</span>
+        <li id="publishDate">Last Published: 2018-07-09<span class="divider">|</span>
 </li>
           <li id="projectVersion">Version: 1.10-SNAPSHOT</li>
         </ul>
@@ -241,87 +241,79 @@
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
-  --><div class="section">
+  -->
+<div class="section">
 <h2><a name="The_Blob_Store"></a>The Blob Store</h2>
 <p>The Oak BlobStore is similar to the Jackrabbit 2.x DataStore. However, there are
a few minor problems the BlobStore tries to address. Because, for the Jackrabbit DataStore:</p>
-
 <ul>
-  
+
 <li>
-<p>a temporary file is created when adding a large binary,  even if the binary already
exists</p></li>
-  
+
+<p>a temporary file is created when adding a large binary, even if the binary already
exists</p>
+</li>
 <li>
-<p>sharding is slow and complicated because the hash needs to be calculated  first,
before the binary is stored in the target shard (the FileDataStore  still doesn&#x2019;t
support sharding the directory currently)</p></li>
-  
+
+<p>sharding is slow and complicated because the hash needs to be calculated first,
before the binary is stored in the target shard (the FileDataStore still doesn&#x2019;t
support sharding the directory currently)</p>
+</li>
 <li>
-<p>file handles are kept open until the consumer is done reading, which  complicates
the code, and we could potentially get &#x201c;too many open files&#x201d;  when the
consumer doesn&#x2019;t close the stream</p></li>
-  
+
+<p>file handles are kept open until the consumer is done reading, which complicates
the code, and we could potentially get &#x201c;too many open files&#x201d; when the
consumer doesn&#x2019;t close the stream</p>
+</li>
 <li>
-<p>for database based data stores, there is a similar (even worse) problem  that streams
are kept open, which means we need to use connection pooling,  and if the user doesn&#x2019;t
close the stream we could run out of connections</p></li>
-  
+
+<p>for database based data stores, there is a similar (even worse) problem that streams
are kept open, which means we need to use connection pooling, and if the user doesn&#x2019;t
close the stream we could run out of connections</p>
+</li>
 <li>
-<p>for database based data stores, for some databases (MySQL), binaries are  fully
read in memory, which results in out-of-memory</p></li>
-  
+
+<p>for database based data stores, for some databases (MySQL), binaries are fully read
in memory, which results in out-of-memory</p>
+</li>
 <li>
-<p>binaries that are similar are always stored separately no matter what</p></li>
+
+<p>binaries that are similar are always stored separately no matter what</p>
+</li>
 </ul>
 <p>Those problems are solved in Oak BlobStores, because binaries are split into blocks
of 2 MB. This is similar to how <a class="externalLink" href="http://serverfault.com/questions/52861/how-does-dropbox-version-upload-large-files">DropBox
works internally</a>. Blocks are processed in memory so that temp files are never needed,
and blocks are cached. File handles don&#x2019;t need to be kept open. Sharding is trivial
because each block is processed separately.</p>
 <p>Binaries that are similar: in the BlobStore, currently, they are stored separately
except if some of the 2 MB blocks match. However, the algorithm in the BlobStore would allow
to re-use all matching parts, because in the BlobStore, concatenating blob ids means concatenating
the data.</p>
 <p>Another change was that most DataStore implementations use SHA-1, while the BlobStore
uses SHA-256. Using SHA-256 will be a requirement at some point, see also <a class="externalLink"
href="http://en.wikipedia.org/wiki/SHA-2">http://en.wikipedia.org/wiki/SHA-2</a>
&#x201c;Federal agencies &#x2026; must use the SHA-2 family of hash functions for
these applications after 2010&#x201d;. This might affect some potential users.</p>
 <div class="section">
 <h3><a name="Support_for_Jackrabbit_2_DataStore"></a>Support for Jackrabbit
2 DataStore</h3>
-<p>Jackrabbit 2 used <a class="externalLink" href="http://wiki.apache.org/jackrabbit/DataStore">DataStore</a>
to store blobs. Oak supports usage of such DataStore via <tt>DataStoreBlobStore</tt>
wrapper. This allows usage of <tt>FileDataStore</tt> and <tt>S3DataStore</tt>
with Oak NodeStore implementations. </p></div>
+<p>Jackrabbit 2 used <a class="externalLink" href="http://wiki.apache.org/jackrabbit/DataStore">DataStore</a>
to store blobs. Oak supports usage of such DataStore via <tt>DataStoreBlobStore</tt>
wrapper. This allows usage of <tt>FileDataStore</tt> and <tt>S3DataStore</tt>
with Oak NodeStore implementations.</p></div>
 <div class="section">
 <h3><a name="NodeStore_and_BlobStore"></a>NodeStore and BlobStore</h3>
 <p>Currently Oak provides two NodeStore implementations i.e. <tt>SegmentNodeStore</tt>
and <tt>DocumentNodeStore</tt>. Further Oak ships with multiple BlobStore implementations</p>
-
 <ol style="list-style-type: decimal">
-  
+
 <li><tt>FileBlobStore</tt> - Stores the file contents in chunks on file
system</li>
-  
-<li><tt>MongoBlobStore</tt> - Stores the file content in chunks in Mongo.
Typically used with  <tt>DocumentNodeStore</tt> when running on Mongo by default</li>
-  
-<li><tt>FileDataStore</tt> (with wrapper) - Stores the file on file system
without breaking it into  chunks. Mostly used when blobs have to shared between multiple repositories.
Also used by  default when migrating Jackrabbit 2 repositories to Oak</li>
-  
+<li><tt>MongoBlobStore</tt> - Stores the file content in chunks in Mongo.
Typically used with <tt>DocumentNodeStore</tt> when running on Mongo by default</li>
+<li><tt>FileDataStore</tt> (with wrapper) - Stores the file on file system
without breaking it into chunks. Mostly used when blobs have to shared between multiple repositories.
Also used by default when migrating Jackrabbit 2 repositories to Oak</li>
 <li><tt>S3DataStore</tt> (with wrapper) - Stores the file in Amazon S3</li>
-  
-<li><tt>RDBBlobStore</tt> - Store the file contents in chunks in a relational
databases. Typically used with  <tt>DocumentNodeStore</tt>when using a relational
DB persistence</li>
-  
+<li><tt>RDBBlobStore</tt> - Store the file contents in chunks in a relational
databases. Typically used with <tt>DocumentNodeStore</tt>when using a relational
DB persistence</li>
 <li><tt>AzureDataStore</tt> (with wrapper) - Stores the file in Microsoft
Azure Blob storage</li>
 </ol>
 <p>In addition there are some more implementations which are considered <b>experimental</b></p>
-
 <ol style="list-style-type: decimal">
-  
+
 <li><tt>CloudBlobStore</tt> - Stores the file file chunks in cloud storage
using the <a class="externalLink" href="http://jclouds.apache.org/start/blobstore/">JClouds
BlobStore API</a>.</li>
-  
 <li><tt>MongoGridFSBlobStore</tt> - Stores the file chunks in Mongo using
GridFS support</li>
 </ol>
-<p>Depending on NodeStore type and usage requirement these can be configured to use
a particular BlobStore implementation. For OSGi env refer to <a href="../osgi_config.html#config-blobstore">Configuring
DataStore/BlobStore</a></p>
+<p>Depending on NodeStore type and usage requirement these can be configured to use
a particular BlobStore implementation. For OSGi env refer to [Configuring DataStore/BlobStore]
(../osgi_config.html#config-blobstore)</p>
 <div class="section">
 <h4><a name="SegmentNodeStore_TarMK"></a>SegmentNodeStore (TarMK)</h4>
-<p>By default SegmentNodeStore (aka TarMK) does not require a BlobStore. Instead the
binary content is directly stored as part of segment blob itself. Depending on requirements
one of the following can be used </p>
-
+<p>By default SegmentNodeStore (aka TarMK) does not require a BlobStore. Instead the
binary content is directly stored as part of segment blob itself. Depending on requirements
one of the following can be used</p>
 <ul>
-  
-<li>FileDataStore - This should be used if the blobs/binaries have to be shared between
multiple  repositories. This would also be used when a JR2 repository is migrated to Oak</li>
-  
+
+<li>FileDataStore - This should be used if the blobs/binaries have to be shared between
multiple repositories. This would also be used when a JR2 repository is migrated to Oak</li>
 <li>S3DataStore - This should be used when binaries are stored in Amazon S3</li>
-  
 <li>AzureDataStore - This should be used when binaries are stored in Microsoft Azure
Blob storage</li>
 </ul></div>
 <div class="section">
 <h4><a name="DocumentNodeStore"></a>DocumentNodeStore</h4>
 <p>By default DocumentNodeStore when running on Mongo uses <tt>MongoBlobStore</tt>.
Depending on requirements one of the following can be used</p>
-
 <ul>
-  
+
 <li>MongoBlobStore - Used by default and recommended only for development and testing.</li>
-  
-<li>FileDataStore - This should be used if the binaries have to be stored on the file
system. This  would also be used when a JR2 repository is migrated to Oak</li>
-  
-<li>S3DataStore - This should be used when binaries are stored in Amazon S3. Typically
used when running  in Amazon AWS</li>
-  
+<li>FileDataStore - This should be used if the binaries have to be stored on the file
system. This would also be used when a JR2 repository is migrated to Oak</li>
+<li>S3DataStore - This should be used when binaries are stored in Amazon S3. Typically
used when running in Amazon AWS</li>
 <li>AzureDataStore - This should be used when binaries are stored in Microsoft Azure
Blob storage</li>
 </ul></div>
 <div class="section">
@@ -329,215 +321,143 @@
 <p>The DataStore implementations <tt>S3DataStore</tt>,<tt>CachingFileDataStore</tt>
and <tt>AzureDataStore</tt> support local file system caching for the files/blobs
and extend the <tt>AbstractSharedCachingDataStore</tt> class which implements
the caching functionality. The <tt>CachingFileDataStore</tt> is useful when the
DataStore is on nfs. The cache has a size limit and is configured by the <tt>cacheSize</tt>
parameter.</p></div>
 <div class="section">
 <h4><a name="Downloads"></a>Downloads</h4>
-<p>The local cache will be checked for existence of the record corresponding to the
requested file/blob before accessing it  from the DataStore. When the cache exceeds the limit
configured while adding a file into the cache then some of the  file(s) will be evicted to
reclaim space. </p></div>
+<p>The local cache will be checked for existence of the record corresponding to the
requested file/blob before accessing it from the DataStore. When the cache exceeds the limit
configured while adding a file into the cache then some of the file(s) will be evicted to
reclaim space.</p></div>
 <div class="section">
 <h4><a name="Asynchronous_Uploads"></a>Asynchronous Uploads</h4>
-<p>The cache also supports asynchronous uploads to the DataStore. The files are staged
locally in the cache on the file system and an asynchronous job started to upload the file.
The number of asynchronous uploads are limited by the  size of staging cache configured by
the <tt>stagingSplitPercentage</tt> parameter and is by default set to 10. This
defines  the ratio of the <tt>cacheSize</tt> to be dedicated for the staging cache.
The percentage of cache available for downloads is  calculated as (100 - stagingSplitPerentage)
* cacheSize (by default 90). The asynchronous uploads are also  multi-threaded and is governed
by the <tt>uploadThreads</tt> configuration parameter. The default value is 10.</p>
-<p>The files are moved to the main download cache after the uploads are complete. When
the staging cache exceeds the limit,  the files are uploaded synchronously to the DataStore
until the previous asynchronous uploads are complete and space  available in the staging cache.
The uploaded files are removed from the staging area by a periodic job whose interval  is
configured by the <tt>stagingPurgeInterval</tt> configuration parameter. The default
value is 300 seconds.</p>
+<p>The cache also supports asynchronous uploads to the DataStore. The files are staged
locally in the cache on the file system and an asynchronous job started to upload the file.
The number of asynchronous uploads are limited by the size of staging cache configured by
the <tt>stagingSplitPercentage</tt> parameter and is by default set to 10. This
defines the ratio of the <tt>cacheSize</tt> to be dedicated for the staging cache.
The percentage of cache available for downloads is calculated as (100 - stagingSplitPerentage)
* cacheSize (by default 90). The asynchronous uploads are also multi-threaded and is governed
by the <tt>uploadThreads</tt> configuration parameter. The default value is 10.</p>
+<p>The files are moved to the main download cache after the uploads are complete. When
the staging cache exceeds the limit, the files are uploaded synchronously to the DataStore
until the previous asynchronous uploads are complete and space available in the staging cache.
The uploaded files are removed from the staging area by a periodic job whose interval is configured
by the <tt>stagingPurgeInterval</tt> configuration parameter. The default value
is 300 seconds.</p>
 <p>Any failed uploads (due to various reasons e.g. network disruption) are put on a
retry queue and retried periodically with the configured interval <tt>stagingRetryInterval</tt>.
The default value for is 600 seconds.</p></div>
 <div class="section">
 <h4><a name="Caching_Stats"></a>Caching Stats</h4>
 <p>The <tt>ConsolidatedDataStoreCacheStats</tt> is registered as an MBean
and provides a snapshot of the cache performance for both the download and the upload staging
cache.</p>
 <p><img src="../img/datastore-cache-stats.png" alt="datastore cache stats" /></p>
 <p>The following table explains the different statistics exposed for both type of caches</p>
-
 <table border="0" class="table table-striped">
-  <thead>
-    
+<thead>
+
 <tr class="a">
-      
-<th align="center">Parameters </th>
-      
-<th align="center">DataStore-DownloadCache </th>
-      
-<th align="center">DataStore-StagingCache </th>
-    </tr>
-  </thead>
-  <tbody>
-    
+<th align="center"> Parameters        </th>
+<th align="center"> DataStore-DownloadCache                     </th>
+<th align="center"> DataStore-StagingCache </th></tr>
+</thead><tbody>
+
 <tr class="b">
-      
-<td align="center">elementCount </td>
-      
-<td align="center">Number of files cached </td>
-      
-<td align="center">Pending file uploads in cache</td>
-    </tr>
-    
+<td align="center">elementCount       </td>
+<td align="center"> Number of files cached                      </td>
+<td align="center"> Pending file uploads in cache</td></tr>
 <tr class="a">
-      
-<td align="center">requestCount </td>
-      
-<td align="center">Number of files requested from cache </td>
-      
-<td align="center">Number of file uploads requested</td>
-    </tr>
-    
+<td align="center">requestCount       </td>
+<td align="center"> Number of files requested from cache        </td>
+<td align="center"> Number of file uploads requested</td></tr>
 <tr class="b">
-      
-<td align="center">hitCount </td>
-      
-<td align="center">Number of files served from cache </td>
-      
-<td align="center">Number of files uploaded asynchronously</td>
-    </tr>
-    
+<td align="center">hitCount           </td>
+<td align="center"> Number of files served from cache           </td>
+<td align="center"> Number of files uploaded asynchronously</td></tr>
 <tr class="a">
-      
-<td align="center">hitRate </td>
-      
-<td align="center">Ratio of hits to requests </td>
-      
-<td align="center">Ratio of hits to requests</td>
-    </tr>
-    
+<td align="center">hitRate            </td>
+<td align="center"> Ratio of hits to requests                   </td>
+<td align="center"> Ratio of hits to requests</td></tr>
 <tr class="b">
-      
-<td align="center">loadCount </td>
-      
-<td align="center">Number of files loaded when not in cache </td>
-      
-<td align="center">Number of file requests from cache</td>
-    </tr>
-    
+<td align="center">loadCount          </td>
+<td align="center"> Number of files loaded when not in cache    </td>
+<td align="center"> Number of file requests from cache</td></tr>
 <tr class="a">
-      
-<td align="center">loadSuccessCount </td>
-      
-<td align="center">Number of files successfully loaded </td>
-      
-<td align="center">Number of file requests served from cache</td>
-    </tr>
-    
+<td align="center">loadSuccessCount   </td>
+<td align="center"> Number of files successfully loaded         </td>
+<td align="center"> Number of file requests served from cache</td></tr>
 <tr class="b">
-      
 <td align="center">loadExceptionCount </td>
-      
-<td align="center">Number of load file unsuccessful </td>
-      
-<td align="center">Number of file requests not in cache</td>
-    </tr>
-    
+<td align="center"> Number of load file unsuccessful            </td>
+<td align="center"> Number of file requests not in cache</td></tr>
 <tr class="a">
-      
-<td align="center">maxWeight </td>
-      
-<td align="center">Max cache size (bytes) </td>
-      
-<td align="center">Max cache size (bytes)</td>
-    </tr>
-    
+<td align="center">maxWeight          </td>
+<td align="center"> Max cache size (bytes)                      </td>
+<td align="center"> Max cache size (bytes)</td></tr>
 <tr class="b">
-      
-<td align="center">totalWeight </td>
-      
-<td align="center">Current size of cache (bytes </td>
-      
-<td align="center">Current size of cache (bytes)bytes</td>
-    </tr>
-    
+<td align="center">totalWeight        </td>
+<td align="center"> Current size of cache (bytes                </td>
+<td align="center"> Current size of cache (bytes)bytes</td></tr>
 <tr class="a">
-      
-<td align="center">totalMemWeight </td>
-      
-<td align="center">Approximate size of cache in-memory (bytes) </td>
-      
-<td align="center">Approximate size of cache in memory (bytes)</td>
-    </tr>
-  </tbody>
+<td align="center">totalMemWeight     </td>
+<td align="center"> Approximate size of cache in-memory (bytes) </td>
+<td align="center"> Approximate size of cache in memory (bytes)</td></tr>
+</tbody>
 </table>
 <p>The parameters above can be used to size the cache. For example:</p>
-
 <ul>
-  
+
 <li>The hitRate is an important parameter and if much below 1 then indicates that the
cache is low based on the load and should be increased.</li>
-  
 <li>If the staging cache has a low hit ratio, download cache has a high hit ratio and
also its current size is much less than the maxSize then its better to increase the <tt>stagingSplitPercentage</tt>
parameter.</li>
 </ul>
-<p>The MBean also exposes a method <tt>isFileSynced</tt> which takes a
node path of a binary and returns whether the associated  file/blob has been uploaded to the
DataStore.</p></div>
+<p>The MBean also exposes a method <tt>isFileSynced</tt> which takes a
node path of a binary and returns whether the associated file/blob has been uploaded to the
DataStore.</p></div>
 <div class="section">
 <h4><a name="Upgrade_Pre_Oak_1.6_caching"></a>Upgrade (Pre Oak 1.6 caching)</h4>
-<p>When upgrading from the older cache implementation the process should be seamless
and any pending uploads would be scheduled for upload and any previously downloaded files
in the cache will be put in the cache on initialization. There is a slight difference in the
structure of the local file system cache directory. Whereas in the older cache structure both
the downloaded and the upload files were put directly under the cache path. The newer structure
segregates the downloads and uploads and stores them under cache path under the directories
<tt>download</tt> and <tt>upload</tt><br />respectively.</p>
+<p>When upgrading from the older cache implementation the process should be seamless
and any pending uploads would be scheduled for upload and any previously downloaded files
in the cache will be put in the cache on initialization. There is a slight difference in the
structure of the local file system cache directory. Whereas in the older cache structure both
the downloaded and the upload files were put directly under the cache path. The newer structure
segregates the downloads and uploads and stores them under cache path under the directories
<tt>download</tt> and <tt>upload</tt><br />
+respectively.</p>
 <p>There is also an option to upgrade the cache offline by using the <tt>datastorecacheupgrade</tt>
command of oak-run. The details on how to execute the command and the different parameters
can be checked in the readme for the oak-run module.</p></div></div>
 <div class="section">
 <h3><a name="Blob_Garbage_Collection"></a>Blob Garbage Collection</h3>
 <p>Blob Garbage Collection(GC) is applicable for the following blob stores:</p>
-
 <ul>
-  
+
 <li>
+
 <p>DocumentNodeStore</p>
-  
 <ul>
-    
+
 <li>MongoBlobStore/RDBBlobStore (Default blob stores for RDB &amp; Mongo)</li>
-    
 <li>FileDataStore</li>
-    
 <li>S3DataStore</li>
-    
 <li>SharedS3DataStore (since Oak 1.2.0)</li>
-  </ul></li>
-  
+</ul>
+</li>
 <li>
-<p>SegmentNodeStore </p>
-  
+
+<p>SegmentNodeStore</p>
 <ul>
-    
+
 <li>FileDataStore</li>
-    
 <li>S3DataStore</li>
-    
 <li>SharedS3DataStore (since Oak 1.2.0)</li>
-    
 <li>AzureDataStore</li>
-  </ul></li>
 </ul>
-<p>Oak implements a Mark and Sweep based Garbage Collection logic. </p>
-
+</li>
+</ul>
+<p>Oak implements a Mark and Sweep based Garbage Collection logic.</p>
 <ol style="list-style-type: decimal">
-  
-<li>Mark Phase - In this phase the binary references are marked in both  BlobStore
and NodeStore
-  
+
+<li>Mark Phase - In this phase the binary references are marked in both BlobStore and
NodeStore
 <ol style="list-style-type: decimal">
-    
-<li>Mark BlobStore - GC logic would make a record of all the blobs  present in the
BlobStore.</li>
-    
-<li>Mark NodeStore - GC logic would make a record of all the blob  references which
are referred by any node present in NodeStore.  Note that any blob references from old revisions
of node would also be  considered as a valid references.</li>
-  </ol></li>
-  
+
+<li>Mark BlobStore - GC logic would make a record of all the blobs present in the BlobStore.</li>
+<li>Mark NodeStore - GC logic would make a record of all the blob references which
are referred by any node present in NodeStore. Note that any blob references from old revisions
of node would also be considered as a valid references.</li>
+</ol>
+</li>
 <li>Sweep Phase - In this phase all blob references form Mark BlobStore phase which
were not found in Mark NodeStore part would considered as GC candidates. It would only delete
blobs which are older than a specified time interval (last modified say 24 hrs (default) ago).</li>
 </ol>
 <p>The garbage collection can be triggered by calling:</p>
-
 <ul>
-  
+
 <li><tt>MarkSweepGarbageCollector#collectGarbage()</tt> (Oak 1.0.x)</li>
-  
 <li><tt>MarkSweepGarbageCollector#collectGarbage(false)</tt> (Oak 1.2.x)</li>
-  
 <li>If the MBeans are registered in the MBeanServer then the following can also be
used to trigger GC:
-  
 <ul>
-    
+
 <li><tt>BlobGC#startBlobGC()</tt> which takes in a <tt>markOnly</tt>
boolean parameter to indicate mark only or complete gc</li>
-  </ul></li>
 </ul>
-<p><a name="blobid-tracker"></a> </p>
+</li>
+</ul>
+<p><a name="blobid-tracker"></a></p>
 <div class="section">
 <h4><a name="Caching_of_Blob_ids_locally_Oak_1.6.x"></a>Caching of Blob
ids locally (Oak 1.6.x)</h4>
 <p>For the <tt>FileDataStore</tt>, <tt>S3DataStore</tt> and
<tt>AzureDataStore</tt> the blob ids are cached locally on the disk when they
are created which speeds up the &#x2018;Mark BlobStore&#x2019; phase. The locally
tracked ids are synchronized with the data store periodically to enable other cluster nodes
or different repositories sharing the datastore to get a consolidated list of all blob ids.
The interval of synchronization is defined by the OSGi configuration parameter <tt>blobTrackSnapshotIntervalInSecs</tt>
for the configured NodeStore services.</p>
-<p>If 2 garbage collection cycles are executed within the <tt>blobTrackSnapshotIntervalInSecs</tt>
then there may be warnings in the logs of some missing blob ids which is due to the fact that
the deletions due to earlier gc has not been synchronized with the data store. It&#x2019;s
ok to either ignore these warnings or to adjust the <tt>blobTrackSnapshotIntervalInSecs</tt>
 parameter according to the schedule identified for running blob gc.</p>
+<p>If 2 garbage collection cycles are executed within the <tt>blobTrackSnapshotIntervalInSecs</tt>
then there may be warnings in the logs of some missing blob ids which is due to the fact that
the deletions due to earlier gc has not been synchronized with the data store. It&#x2019;s
ok to either ignore these warnings or to adjust the <tt>blobTrackSnapshotIntervalInSecs</tt>
parameter according to the schedule identified for running blob gc.</p>
 <p>When upgrading an existing system to take advantage of caching the existing blob
ids have to be cached. One of the following should be executed.</p>
-
 <ul>
-  
+
 <li>Use <tt>MarkSweepGarbageCollectot#collectGarbage(boolean markOnly, boolean
forceBlobRetrieve)</tt> with <tt>true</tt> for <tt>forceBlobRetrieve</tt>
parameter to force retrieving blob ids from the datastore and cache locally also.</li>
-  
 <li>Execute Blob GC before the configured time duration of <tt>blobTrackSnapshotIntervalInSecs</tt>.</li>
-  
 <li>Execute <a href="#consistency-check">consistency check</a> from the
JMX BlobGCMbean before the configured time duration of <tt>blobTrackSnapshotIntervalInSecs</tt>.</li>
-  
 <li>Execute <tt>datastorecheck</tt> command offline using oak-run with
the <tt>--track</tt> option as defined in <a href="#consistency-check">consistency
check</a>.</li>
 </ul></div>
 <div class="section">
@@ -545,13 +465,10 @@
 <div class="section">
 <h5><a name="Registration"></a>Registration</h5>
 <p>On start of a repository configured to use a shared DataStore (same path, S3 bucket
or Azure container), a unique repository id is generated and registered in the NodeStore as
well as the DataStore. In the DataStore this repository id is registered as an empty file
with the format <tt>repository-[repository-id]</tt> (e.g. repository-988373a0-3efb-451e-ab4c-f7e794189273).
This empty file is created under:</p>
-
 <ul>
-  
+
 <li>FileDataStore - Under the root directory configured for the datastore.</li>
-  
 <li>S3DataStore - Under <tt>META</tt> folder in the S3 bucket configured.</li>
-  
 <li>AzureDataStore - Under <tt>META</tt> folder in the Azure container
configured.</li>
 </ul>
 <p>On start/configuration of all the repositories sharing the data store it should
be confirmed that the unique repositoryId per repository is registered in the DataStore. Refer
the section below on <a href="#check-shared-datastore-gc">Checking Shared GC status</a>.</p></div>
@@ -559,77 +476,64 @@
 <h5><a name="Execution"></a>Execution</h5>
 <p>The high-level process for garbage collection is still the same as described above.
But to support blob garbage collection in a shared DataStore the Mark and Sweep phase can
be run independently.</p>
 <p>The details of the process are as follows:</p>
-
 <ul>
-  
+
 <li>The Mark NodeStore phase has to be executed for each of the repositories sharing
the DataStore.
-  
 <ul>
-    
+
 <li>This can be executed by running <tt>MarkSweepGarbageCollector#collectGarbage(true)</tt>,
where true indicates mark only.</li>
-    
 <li>All the references are collected in the DataStore in a file with the format <tt>references-[repository-id]</tt>
(e.g. references-988373a0-3efb-451e-ab4c-f7e794189273).</li>
-  </ul></li>
-  
+</ul>
+</li>
 <li>One completion of the above process on all repositories, the sweep phase needs
to be triggered.
-  
 <ul>
-    
+
 <li>This can be executed by running <tt>MarkSweepGarbageCollector#collectGarbage(false)</tt>
on one of the repositories, where false indicates to run sweep also.</li>
-    
 <li>The sweep process checks for availability of the references file from all registered
repositories (all repositories corresponding to the <tt>repository-[repositoryId]</tt>
files available) and aborts otherwise.</li>
-    
 <li>All the references available are collected.</li>
-    
-<li>All the blobs available in the DataStore are collected and deletion candidates
identified by calculating all the blobs available not appearing in the blobs referenced. Only
blobs older than a specified time interval from the earliest available references file are
deleted. (last modified say 24 hrs (default)). The earliest references are  identified by
means of a timestamp marker file (<tt>markedTimestamp-[repositoryId]</tt>) for
each repository.</li>
-  </ul></li>
+<li>All the blobs available in the DataStore are collected and deletion candidates
identified by calculating all the blobs available not appearing in the blobs referenced. Only
blobs older than a specified time interval from the earliest available references file are
deleted. (last modified say 24 hrs (default)). The earliest references are identified by means
of a timestamp marker file (<tt>markedTimestamp-[repositoryId]</tt>) for each
repository.</li>
+</ul>
+</li>
 </ul>
 <p>The shared DataStore garbage collection is applicable for the following DataStore(s):</p>
-
 <ul>
-  
+
 <li>FileDataStore</li>
-  
-<li>SharedS3DataStore - Extends the S3DataStore to enable sharing of the data store
with  multiple repositories</li>
+<li>SharedS3DataStore - Extends the S3DataStore to enable sharing of the data store
with multiple repositories</li>
 </ul>
-<p><a name="check-shared-datastore-gc"></a> </p></div>
+<p><a name="check-shared-datastore-gc"></a></p></div>
 <div class="section">
 <h5><a name="Checking_GC_status_for_Shared_DataStore_Garbage_Collection"></a>Checking
GC status for Shared DataStore Garbage Collection</h5>
 <p>The status of the GC operations on all the repositories connected to the DataStore
can be checked by calling:</p>
-
 <ul>
-  
+
 <li><tt>MarkSweepGarbageCollector#getStats()</tt> which returns a list
of <tt>GarbageCollectionRepoStats</tt> objects having the following fields:
-  
 <ul>
-    
+
 <li>repositoryId - The repositoryId of the repository
-    
 <ul>
-      
+
 <li>local - repositoryId tagged with an asterix(*) indicates whether the repositoryId
is of local instance where the operation ran.</li>
-    </ul></li>
-    
+</ul>
+</li>
 <li>startTime - Start time of the mark operation on the repository</li>
-    
 <li>endTime - End time of the mark operation on the repository</li>
-    
 <li>length - Size of the references file created</li>
-    
 <li>numLines - Number of references available</li>
-  </ul></li>
-  
+</ul>
+</li>
 <li>If the MBeans are registered in the MBeanServer then the following can also be
used to retrieve the status:
-  
 <ul>
-    
+
 <li><tt>BlobGC#getBlobGCStatus()</tt> which returns a CompositeData with
the above fields.</li>
-  </ul></li>
 </ul>
-<p>This operation can also be used to ascertain when the &#x2018;Mark&#x2019;
phase has executed successfully on all the repositories,  as part of the steps to automate
GC in the Shared DataStore configuration. It should be a sufficient condition to check that
the references file is available on all repositories. If the server running Oak has remote
JMX connection enabled the following code example can be used to connect remotely and check
if the mark phase has concluded on all repository instances.</p>
+</li>
+</ul>
+<p>This operation can also be used to ascertain when the &#x2018;Mark&#x2019;
phase has executed successfully on all the repositories, as part of the steps to automate
GC in the Shared DataStore configuration. It should be a sufficient condition to check that
the references file is available on all repositories. If the server running Oak has remote
JMX connection enabled the following code example can be used to connect remotely and check
if the mark phase has concluded on all repository instances.</p>
 
-<div class="source">
-<div class="source"><pre class="prettyprint">import java.util.Hashtable;
+<div>
+<div>
+<pre class="source">import java.util.Hashtable;
 
 import javax.management.openmbean.TabularData;
 import javax.management.MBeanServerConnection;
@@ -691,54 +595,46 @@ public class GetGCStats {
         return markDoneOnOthers;
     }
 }
-</pre></div></div></div>
+</pre></div></div>
+</div>
 <div class="section">
 <h5><a name="Unregistration"></a>Unregistration</h5>
 <p>If a repository no longer shares the DataStore then it needs to be unregistered
from the shared DataStore by following the steps:</p>
-
 <ul>
-  
+
 <li>Identify the repositoryId for the repository using the steps above.</li>
-  
 <li>Remove the corresponding registered repository file (<tt>repository-[repositoryId]</tt>)
from the DataStore
-  
 <ul>
-    
+
 <li>FileDataStore - Remove the file from the data store root directory.</li>
-    
 <li>S3DataStore - Remove the file from the <tt>META</tt> folder of the
S3 bucket.</li>
-    
 <li>AzureDataStore - Remove the file from the <tt>META</tt> folder of the
Azure container.</li>
-  </ul></li>
-  
+</ul>
+</li>
 <li>Remove other files corresponding to the particular repositoryId e.g. <tt>markedTimestamp-[repositoryId]</tt>
or <tt>references-[repositoryId]</tt>.</li>
 </ul>
-<p><a name="consistency-check"></a> </p></div></div>
+<p><a name="consistency-check"></a></p></div></div>
 <div class="section">
 <h4><a name="Consistency_Check"></a>Consistency Check</h4>
 <p>The data store consistency check will report any data store binaries that are missing
but are still referenced. The consistency check can be triggered by:</p>
-
 <ul>
-  
+
 <li><tt>MarkSweepGarbageCollector#checkConsistency</tt></li>
-  
 <li>If the MBeans are registered in the MBeanServer then the following can also be
used:
-  
 <ul>
-    
+
 <li><tt>BlobGCMbean#checkConsistency</tt></li>
-  </ul></li>
 </ul>
-<p>After the consistency check is complete, a message will show the number of binaries
reported as missing. If the number is greater than 0, check the logs configured for <tt>org.apache.jackrabbit.oak.plugins.blob
-.MarkSweepGarbageCollector</tt> for more details on the missing binaries. </p>
+</li>
+</ul>
+<p>After the consistency check is complete, a message will show the number of binaries
reported as missing. If the number is greater than 0, check the logs configured for <tt>org.apache.jackrabbit.oak.plugins.blob
.MarkSweepGarbageCollector</tt> for more details on the missing binaries.</p>
 <p>Below is an example of how the missing binaries are reported in the logs:</p>
-
 <blockquote>
-<p>11:32:39.673 INFO [main] MarkSweepGarbageCollector.java:600 Consistency check found
<a class="externalLink" href="http://serverfault.com/questions/52861/how-does-dropbox-version-upload-large-files">1</a>
missing blobs 11:32:39.673 WARN [main] MarkSweepGarbageCollector.java:602 Consistency check
failure in the the blob store : DataStore backed BlobStore [org.apache.jackrabbit.oak.plugins.blob.datastore.OakFileDataStore],
check missing candidates in file /tmp/gcworkdir-1467352959243/gccand-1467352959243 </p>
-</blockquote>
 
+<p>11:32:39.673 INFO [main] MarkSweepGarbageCollector.java:600 Consistency check found
<a class="externalLink" href="http://serverfault.com/questions/52861/how-does-dropbox-version-upload-large-files">1</a>
missing blobs 11:32:39.673 WARN [main] MarkSweepGarbageCollector.java:602 Consistency check
failure in the the blob store : DataStore backed BlobStore [org.apache.jackrabbit.oak.plugins.blob.datastore.OakFileDataStore],
check missing candidates in file /tmp/gcworkdir-1467352959243/gccand-1467352959243</p>
+</blockquote>
 <ul>
-  
+
 <li><tt>datastorecheck</tt> command of oak-run can also be used to execute
a consistency check on the datastore. The details on how to execute the command and the different
parameters can be checked in the readme for the oak-run module.</li>
 </ul></div></div></div>
         </div>

Modified: jackrabbit/site/live/oak/docs/query/flags.html
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query/flags.html?rev=1835390&r1=1835389&r2=1835390&view=diff
==============================================================================
--- jackrabbit/site/live/oak/docs/query/flags.html (original)
+++ jackrabbit/site/live/oak/docs/query/flags.html Mon Jul  9 08:53:17 2018
@@ -1,13 +1,13 @@
 <!DOCTYPE html>
 <!--
- | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2018-05-24 
+ | Generated by Apache Maven Doxia Site Renderer 1.8.1 at 2018-07-09 
  | Rendered using Apache Maven Fluido Skin 1.6
 -->
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <head>
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <meta name="Date-Revision-yyyymmdd" content="20180524" />
+    <meta name="Date-Revision-yyyymmdd" content="20180709" />
     <meta http-equiv="Content-Language" content="en" />
     <title>Jackrabbit Oak &#x2013; Flags</title>
     <link rel="stylesheet" href="../css/apache-maven-fluido-1.6.min.css" />
@@ -136,7 +136,7 @@
 
       <div id="breadcrumbs">
         <ul class="breadcrumb">
-        <li id="publishDate">Last Published: 2018-05-24<span class="divider">|</span>
+        <li id="publishDate">Last Published: 2018-07-09<span class="divider">|</span>
 </li>
           <li id="projectVersion">Version: 1.10-SNAPSHOT</li>
         </ul>
@@ -240,13 +240,14 @@
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
-  --><div class="section">
+  -->
+<div class="section">
 <h2><a name="Flags"></a>Flags</h2>
 <p>List of available flags to enable/disable options in the query engine</p>
 <div class="section">
 <div class="section">
 <h4><a name="oak.queryFullTextComparisonWithoutIndex"></a>oak.queryFullTextComparisonWithoutIndex</h4>
-<p><tt>@since 1.2.0</tt> </p>
+<p><tt>@since 1.2.0</tt></p>
 <p>Default is <tt>false</tt>. If provided on the command line like <tt>-Doak.queryFullTextComparisonWithoutIndex=true</tt>
it will allow the query engine to parse full text conditions even if no full-text indexes
are defined.</p></div>
 <div class="section">
 <h4><a name="oak.query.sql2optimisation"></a>oak.query.sql2optimisation</h4>



Mime
View raw message