tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2298) To improve object recognition parser so that it may work without external RESTful service setup
Date Sat, 27 May 2017 16:24:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027480#comment-16027480
] 

ASF GitHub Bot commented on TIKA-2298:
--------------------------------------

asmehra95 opened a new pull request #182: Creation of TIKA-2298 contributed by asmehra95-
Import of vgg16 via Deeplearning4j into tika-dl
URL: https://github.com/apache/tika/pull/182
 
 
   <b>Note:</b> This is a modified form of #159 raised earlier by me.
   I have imported VGG16 model into tika-dl module using deeplearning4j .
   The usage of this recogniser is very similar to TensorFlowRESTrecogniser but it doesn't
require any external setup, like running RESTservice in as in case of TensorFlowRESTrecogniser.
   You can read more about TensorFlowRESTrecogniser at https://wiki.apache.org/tika/TikaAndVision
   
   To use the DL4JVGG16Net set
   class param to org.apache.tika.dl.imagerec.DL4JVGG16Net
   modelType to VGG16
   sample configuration is given below for refference.
   
   ```
   <?xml version="1.0" encoding="UTF-8"?>
   <properties>
       <parsers>
           <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
               <mime>image/jpeg</mime>
               <params>
                   <param name="topN" type="int">2</param>
                   <param name="minConfidence" type="double">0.015</param>
                   <param name="class" type="string">org.apache.tika.dl.imagerec.DL4JVGG16Net</param>
   		<param name="modelType" type="string">VGG16</param>
   		<param name="serialize" type="string">yes</param>
               </params>
           </parser>
       </parsers>
   </properties>
   ```
   Save the configuration at your preffered location. 
   A default one is provided at ``` tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml
```
   
   To run it in default configuration, build the project and move to root directory of the
project and run the command.
   
   '``` java -Xmx3G -cp ./tika-dl/target/tika-dl-1.15-SNAPSHOT-jar-with-dependencies.jar;tika-app/target/tika-app-1.15-SNAPSHOT.jar
org.apache.tika.cli.TikaCLI  --config=tika-dl/src/test/resources/org/apache/tika/dl/imagerec/dl4j-vgg16-config.xml
tika-dl/src/test/resources/org/apache/tika/dl/imagerec/lion.jpg```
   -Xmx3G is required because VGG16 model requires quite a lot of memory to run.
   Observations:
   When loading searilized model from disk:
   It only require around 1200mb of ram to run.
   
   When model is loaded from h5 files using helper functions
   It requires 2500mb of ram to run the model (required only one time if serialization is
set to yes)
   
   Once the model runs, it automatically downloads the model file using helper functions of
DL4J locally at .dl4j/trainedModels
   To speed up the process in future, once the model is loaded from original hash files, it
is serialized and saved on disk at .dl4j/trainedModels/tikaPreprocessed which significantly
reduces
   the resource usage (specially memory consumption) for future loads.
   Issue Link:
   https://issues.apache.org/jira/browse/TIKA-2298
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> To improve object recognition parser so that it may work without external RESTful service
setup
> -----------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2298
>                 URL: https://issues.apache.org/jira/browse/TIKA-2298
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Avtar Singh
>              Labels: ObjectRecognitionParser
>             Fix For: 1.16
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ObjectRecognitionParser was built to do image recognition, there wasn't
> good support for Java frameworks.  All the popular neural networks were in
> C++ or python.  Since there was nothing that runs within JVM, we tried
> several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
> However, this game is changing slowly now. Deeplearning4j, the most famous
> neural network library for JVM, now supports importing models that are
> pre-trained in python/C++ based kits [5].
> *Improvement:*
> It will be nice to have an implementation of ObjectRecogniser that
> doesn't require any external setup(like installation of native libraries or
> starting REST services). Reasons: easy to distribute and also to cut the IO
> time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message