jackrabbit-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chet...@apache.org
Subject svn commit: r1802902 - /jackrabbit/site/live/oak/docs/query/pre-extract-text.html
Date Tue, 25 Jul 2017 08:55:26 GMT
Author: chetanm
Date: Tue Jul 25 08:55:26 2017
New Revision: 1802902

URL: http://svn.apache.org/viewvc?rev=1802902&view=rev
OAK-6370 - Improve documentation for text pre-extraction


Modified: jackrabbit/site/live/oak/docs/query/pre-extract-text.html
URL: http://svn.apache.org/viewvc/jackrabbit/site/live/oak/docs/query/pre-extract-text.html?rev=1802902&r1=1802901&r2=1802902&view=diff
--- jackrabbit/site/live/oak/docs/query/pre-extract-text.html (original)
+++ jackrabbit/site/live/oak/docs/query/pre-extract-text.html Tue Jul 25 08:55:26 2017
@@ -1,15 +1,15 @@
 <!DOCTYPE html>
- | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2017-07-18 
+ | Generated by Apache Maven Doxia Site Renderer 1.7.4 at 2017-07-25 
  | Rendered using Apache Maven Fluido Skin 1.6
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <meta name="Date-Revision-yyyymmdd" content="20170718" />
+    <meta name="Date-Revision-yyyymmdd" content="20170725" />
     <meta http-equiv="Content-Language" content="en" />
-    <title>Jackrabbit Oak &#x2013; Pre-Extracting Text from Binaries</title>
+    <title>Jackrabbit Oak &#x2013; <a name="pre-extract-text"></a>Pre-Extracting
Text from Binaries</title>
     <link rel="stylesheet" href="../css/apache-maven-fluido-1.6.min.css" />
     <link rel="stylesheet" href="../css/site.css" />
     <link rel="stylesheet" href="../css/print.css" media="print" />
@@ -131,7 +131,7 @@
       <div id="breadcrumbs">
         <ul class="breadcrumb">
-        <li id="publishDate">Last Published: 2017-07-18<span class="divider">|</span>
+        <li id="publishDate">Last Published: 2017-07-25<span class="divider">|</span>
           <li id="projectVersion">Version: 1.8-SNAPSHOT</li>
@@ -229,7 +229,35 @@
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
-  --><h1>Pre-Extracting Text from Binaries</h1>
+  --><h1><a name="pre-extract-text"></a>Pre-Extracting Text from Binaries</h1>
+<li><a href="#pre-extract-text">Pre-Extracting Text from Binaries</a>
+<li><a href="#a-oak-run-command">A - Oak Run Pre-Extraction Command</a>
+<li><a href="#a-setup">Step 1 - oak-run Setup</a></li>
+<li><a href="#a-generate-csv">Step 2 - Generate the csv file</a></li>
+<li><a href="#a-perform-text-extraction">Step 3 - Perform the text extraction</a></li>
+    </ul></li>
+<li><a href="#b-pre-extracted-text-provider">B - PreExtractedTextProvider</a>
+<li><a href="#b-oak-app">Oak application</a></li>
+<li><a href="#b-oak-run">Oak Run Indexing</a></li>
+    </ul></li>
+  </ul></li>
 <p><tt>@since Oak 1.0.18, 1.2.3</tt></p>
 <p>Lucene indexing is performed in a single threaded mode. Extracting text from binaries
is an expensive operation and slows down the indexing rate considerably. For incremental indexing
this mostly works fine but if performing a reindex or creating the index for the first time
after migration then it increases the indexing time considerably. To speed up such cases Oak
supports pre extracting text from binaries to avoid extracting text at indexing time. This
feature consist of 2 broad steps </p>
@@ -241,21 +269,22 @@
 <p>For more details on this feature refer to <a class="externalLink" href="https://issues.apache.org/jira/browse/OAK-2892">OAK-2892</a></p>
 <div class="section">
-<h2><a name="A_-_Oak_Run_Pre-Extraction_Command"></a>A - Oak Run Pre-Extraction
+<h2><a name="A_-_Oak_Run_Pre-Extraction_Command"></a><a name="a-oak-run-command"></a>A
- Oak Run Pre-Extraction Command</h2>
 <p>Oak run tool provides a <tt>tika</tt> command which supports traversing
the repository and then extracting text from the binary properties. </p>
 <div class="section">
-<h3><a name="Step_1_-_oak-run_Setup"></a>Step 1 - oak-run Setup</h3>
+<h3><a name="Step_1_-_oak-run_Setup"></a><a name="a-setup"></a>Step
1 - oak-run Setup</h3>
 <p>Download following jars</p>
-<li>oak-run 1.7.4</li>
+<li>oak-run 1.7.4 <a class="externalLink" href="https://repo1.maven.org/maven2/org/apache/jackrabbit/oak-run/1.7.4/oak-run-1.7.4.jar">link</a></li>
 <p>Refer to <a href="../features/oak-run-nodestore-connection-options.html">oak-run
setup</a> for details about connecting to different types of NodeStore. Example below
assume a setup consisting of SegmentNodeStore and FileDataStore. Depending on setup use the
appropriate connection options.</p>
 <p>You can use current oak-run version to perform text extraction for older Oak setups
i.e. its fine to use oak-run from 1.7.x branch to connect to Oak repositories from version
1.0.x or later. The oak-run tooling connects to the repository in read only mode and hence
safe to use with older version.</p>
-<p>The generated extracted text dir can then be used with older setup.</p></div>
+<p>The generated extracted text dir can then be used with older setup.</p>
+<p>Of the following steps #2 i.e. generation of csv file scans the whole repository.
Hence this step should be run when system is not in active use. Step #3 only requires access
to BlobStore and hence can be run while Oak application is in use.</p></div>
 <div class="section">
-<h3><a name="Step_2_-_Generate_the_csv_file"></a>Step 2 - Generate the
csv file</h3>
+<h3><a name="Step_2_-_Generate_the_csv_file"></a><a name="a-generate-csv"></a>Step
2 - Generate the csv file</h3>
 <p>As the first step you would need to generate a csv file which would contain details
about the binary property. This file would be generated by using the <tt>tika</tt>
command from oak-run. In this step oak-run would connect to repository in read only mode.
 <p>To generate the csv file use the <tt>--generate</tt> action</p>
@@ -280,7 +309,7 @@
 <p>By default it scans whole repository. If you need to restrict it to look up under
certain path then specify the path via <tt>--path</tt> option.</p></div>
 <div class="section">
-<h3><a name="Step_3_-_Perform_the_text_extraction"></a>Step 3 - Perform
the text extraction</h3>
+<h3><a name="Step_3_-_Perform_the_text_extraction"></a><a name="a-perform-text-extraction"></a>Step
3 - Perform the text extraction</h3>
 <p>Once the csv file is generated we need to perform the text extraction. To do that
we would need to download the <a class="externalLink" href="https://tika.apache.org/download.html">tika-app</a>
jar from Tika downloads. You should be able to use 1.15 version with Oak 1.7.4 jar.</p>
 <p>To perform the text extraction use the <tt>--extract</tt> action</p>
@@ -304,19 +333,16 @@
 <p>Further the <tt>extract</tt> phase only needs access to <tt>BlobStore</tt>
and does not require access to NodeStore. So this can be run from a different machine (possibly
more powerful to allow use of multiple cores) to speed up text extraction. One can also split
the csv into multiple chunks and process them on different machines and then merge the stores
later. Just ensure that at merge time blobs*.txt files are also merged</p>
 <p>Note that we need to launch the command with <tt>-cp</tt> instead of
<tt>-jar</tt> as we need to include classes outside of oak-run jar like tika-app.
Also ensure that oak-run comes before in classpath. This is required due to some old classes
being packaged in tika-app </p></div></div>
 <div class="section">
-<h2><a name="B_-_PreExtractedTextProvider"></a>B - PreExtractedTextProvider</h2>
+<h2><a name="B_-_PreExtractedTextProvider"></a><a name="b-pre-extracted-text-provider"></a>B
- PreExtractedTextProvider</h2>
 <p>In this step we would configure Oak to make use of the pre extracted text for the
indexing. Depending on how indexing is being performed you would configure the <tt>PreExtractedTextProvider</tt>
either in OSGi or in oak-run index command</p>
 <div class="section">
-<h3><a name="Oak_application"></a>Oak application</h3>
+<h3><a name="Oak_application"></a><a name="b-oak-app"></a>Oak
 <p><tt>@since Oak 1.0.18, 1.2.3</tt></p>
 <p>For this look for OSGi config for <tt>Apache Jackrabbit Oak DataStore PreExtractedTextProvider</tt></p>
-<div class="source">
-<div class="source"><pre class="prettyprint">![OSGi Configuration](pre-extracted-text-osgi.png)
+<p><img src="pre-extracted-text-osgi.png" alt="OSGi Configuration" /> </p>
 <p>Once <tt>PreExtractedTextProvider</tt> is configured then upon reindexing
Lucene indexer would make use of it to check if text needs to be extracted or not. Check <tt>TextExtractionStatsMBean</tt>
for various statistics around text extraction and also to validate if <tt>PreExtractedTextProvider</tt>
is being used.</p></div>
 <div class="section">
-<h3><a name="Oak_Run_Indexing"></a>Oak Run Indexing</h3>
+<h3><a name="Oak_Run_Indexing"></a><a name="b-oak-run"></a>Oak
Run Indexing</h3>
 <p>Configure the directory storing pre extracted text via <tt>--pre-extracted-text-dir</tt>
option in <tt>index</tt> command. See <a href="oak-run-indexing.html">oak
run indexing</a></p></div></div>

View raw message