lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jayendra Patil <jayendra.patil....@gmail.com>
Subject Re: Extracting contents of zipped files with Tika and Solr 1.4.1
Date Mon, 11 Apr 2011 04:02:58 GMT
The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel <phanzel@nearinfinity.com> wrote:
> Hi Gary,
>
> I have been experiencing the same problem... Unable to extract content from
> archive file formats.  I just tried again with a clean install of Solr 3.1.0
> (using Tika 0.8) and continue to experience the same results.  Did you have
> any success with this problem with Solr 1.4.1 or 3.1.0 ?
>
> I'm using this curl command to send data to Solr.
> curl "
> http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true"
> -H "application/octet-stream" -F  "myfile=@data.zip"
>
> No problem extracting single rich text documents, but archive files only
> result in the file names within the archive being indexed. Am I missing
> something else in my configuration? Solr doesn't seem to be unpacking the
> archive files. Based on the email chain associated with your first message,
> some people have been able to get this functionality to work as desired.
>
> On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor <gt@inovem.com> wrote:
>
>> Can anyone shed any light on this, and whether it could be a config issue?
>>  I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.
>>
>> When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to
>> the ExtractingRequestHandler, I get the following log entry (formatted for
>> ease of reading) :
>>
>> SolrInputDocument[
>>    {
>>    ignored_meta=ignored_meta(1.0)={
>>        [stream_source_info, file, stream_content_type,
>> application/octet-stream, stream_size, 260, stream_name, solr1.zip,
>> Content-Type, application/zip]
>>        },
>>    ignored_=ignored_(1.0)={
>>        [package-entry, package-entry]
>>        },
>>    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
>>
>>  ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
>>
>>    ignored_stream_size=ignored_stream_size(1.0)={260},
>>    ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
>>    ignored_content_type=ignored_content_type(1.0)={application/zip},
>>    docid=docid(1.0)={74},
>>    type=type(1.0)={5},
>>    text=text(1.0)={                  doc2.txt    doc1.txt    }
>>    }
>> ]
>>
>> So, the data coming back from Tika when parsing a ZIP file does not include
>> the file contents, only the names of the files contained therein.  I've
>> tried forcing stream.type=application/zip in the CURL string, but that makes
>> no difference.  If I specify an invalid stream.type then I get an exception
>> response, so I know it's being used.
>>
>> When I send one of those txt files individually to the
>> ExtractingRequestHandler, I get:
>>
>> SolrInputDocument[
>>    {
>>    ignored_meta=ignored_meta(1.0)={
>>        [stream_source_info, file, stream_content_type, text/plain,
>> stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
>>        },
>>    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
>>
>>  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
>>    ignored_stream_size=ignored_stream_size(1.0)={30},
>>    ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
>>    ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
>>    docid=docid(1.0)={74},
>>    type=type(1.0)={5},
>>    text=text(1.0)={                The quick brown fox  }
>>    }
>> ]
>>
>> and we see the file contents in the "text" field.
>>
>> I'm using the following requestHandler definition in solrconfig.xml:
>>
>> <!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler -->
>> <requestHandler name="/update/extract"
>> class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
>> startup="lazy">
>> <lst name="defaults">
>> <!-- All the main content goes into "text"... if you need to return
>>           the extracted text or do highlighting, use a stored field. -->
>> <str name="fmap.content">text</str>
>> <str name="lowernames">true</str>
>> <str name="uprefix">ignored_</str>
>>
>> <!-- capture link hrefs but ignore div attributes -->
>> <str name="captureAttr">true</str>
>> <str name="fmap.a">links</str>
>> <str name="fmap.div">ignored_</str>
>> </lst>
>> </requestHandler>
>>
>> Is there any further debug or diagnostic I can get out of Tika to help me
>> work out why it's only returning the file names and not the file contents
>> when parsing a ZIP file?
>>
>>
>> Thanks and kind regards,
>> Gary.
>>
>>
>>
>> On 25/01/2011 16:48, Jayendra Patil wrote:
>>
>>> Hi Gary,
>>>
>>> The latest Solr Trunk was able to extract and index the contents of the
>>> zip
>>> file using the ExtractingRequestHandler.
>>> The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
>>> worked pretty well.
>>>
>>> Tested again with sample url and works fine -
>>> curl "
>>>
>>> http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true
>>> "
>>>
>>> You would probably need to drill down to the Tika Jars and
>>> the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.
>>>
>>> Regards,
>>> Jayendra
>>>
>>>
>>
>

Mime
View raw message