lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-2116) TikaEntityProcessor does not find parser by default
Date Tue, 04 Jan 2011 02:03:47 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977072#action_12977072
] 

Chris A. Mattmann commented on SOLR-2116:
-----------------------------------------

Hey Lance,

bq. Speaking of Tika, have you ever seen a tikaconfig file? I can't find on anywhere on the
web or the Tika source

In the later versions of Tika (I think since 0.7) we've went to an all Service Provider Interface
(SPI) mechanism for Parser config and resource loading, obviating the need to have a tika
config.xml file:

https://issues.apache.org/jira/browse/TIKA-317

However, you still have the option of specifying and using one. See:

http://svn.apache.org/repos/asf/tika/tags/0.8/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java

You can find an example of the XML-based Tika config here:

http://svn.apache.org/repos/asf/tika/tags/0.6/tika-core/src/main/resources/org/apache/tika/

Part of this is also due to the ParseContext which was introduced also as a configuration
mechanism. See:

https://issues.apache.org/jira/browse/TIKA-275

Cheers,
Chris




> TikaEntityProcessor does not find parser by default
> ---------------------------------------------------
>
>                 Key: SOLR-2116
>                 URL: https://issues.apache.org/jira/browse/SOLR-2116
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction)
>    Affects Versions: 3.1, 4.0
>            Reporter: Lance Norskog
>         Attachments: pdflist-data-config.xml, pdflist.xml, SOLR-2116.patch
>
>
> The TikaEntityProcessor does not find the correct document parser by default.
> This is in a two-level DIH config file. I have attached pdflist-data-config.xml and pdflist.xml,
the XML file list supplying. To test this, you will need the current 3.x branch or 4.0 trunk.
> # Set up a Tika-enabled Solr 
> # copy any PDF file to /tmp/testfile.pdf
> # copy the pdflist-data-config.xml to your solr/conf
> # and add this snippet to your solrconfig.xml
> {code:xml}
> <requestHandler name="/pdflist"
>       class="org.apache.solr.handler.dataimport.DataImportHandler">
>   <lst name="defaults">
>               <str name="config">pdflist-data-config.xml</str>
>       </lst>
> </requestHandler>
> {code}
> [http://localhost:8983/solr/pdflist?command=full-import] will make one document with
the id and text fields populated. If you remove this line:
> {code}
>  parser="org.apache.tika.parser.pdf.PDFParser"
> {code}
> from the TikaEntityProcessor entity, the parser will not be found and you will get a
document with the "id" field and nothing else.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message