lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tommaso Teofili (JIRA)" <>
Subject [jira] Commented: (SOLR-1902) Tika no longer properly extracts content in Solr
Date Tue, 27 Jul 2010 22:09:17 GMT


Tommaso Teofili commented on SOLR-1902:

Hi all, I had the same issue David has, so I applied the patch (modifying files one by one)
to a fresh Solr 1.4.1 checkout and I managed to have most of my PDFs being indexed with text
extracted (with the "example" Solr instance). 
Within the apache-solr-1.4.1 release I substituted all the files inside apache-solr-1.4.1/dist
with the ones generated (inside the dist directory) invoking 'ant dist' on the patched 1.4.1
source code, also I substituted the release war with the generated (patched) war inside example/webapps
(this last one was mandatory to avoid the NoSuchMethodError reported above) . Then I ran 'java
-jar start.jar' from example dir and everything worked.
Note that I used the latest version of pdfbox, jembox and fontbox (1.2.1).
I can attach the patch to 1.4.1 code I used.

> Tika no longer properly extracts content in Solr
> ------------------------------------------------
>                 Key: SOLR-1902
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 4.0
> See
> It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when
uploading docs, which then outputs an empty XHTML representation.  Still, it's strange that
the tests pass.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message