lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Hull <char...@flax.co.uk>
Subject Re: Solr 6.4. Can't index MS Visio vsdx files
Date Tue, 04 Jul 2017 13:25:51 GMT
On 11/04/2017 20:48, Allison, Timothy B. wrote:
> It depends.  We've been trying to make parsers more, erm, flexible, but there are some
problems from which we cannot recover.
>
> Tl;dr there isn't a short answer.  :(
>
> My sense is that DIH/ExtractingDocumentHandler is intended to get people up and running
with Solr easily but it is not really a great idea for production.  See Erick's gem: https://lucidworks.com/2012/02/14/indexing-with-solrj/

+1. Tika extraction should happen *outside* Solr in production. A 
colleague even wrote a simple wrapper for Tika to help build this sort 
of thing: https://github.com/mattflax/dropwizard-tika-server

Charlie


>
> As for the Tika portion... at the very least, Tika _shouldn't_ cause the ingesting process
to crash.  At most, it should fail at the file level and not cause greater havoc.  In practice,
if you're processing millions of files from the wild, you'll run into bad behavior and need
to defend against permanent hangs, oom, memory leaks.
>
> Also, at the least, if there's an exception with an embedded file, Tika should catch
it and keep going with the rest of the file.  If this doesn't happen let us know!  We are
aware that some types of embedded file stream problems were causing parse failures on the
entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't let them percolate up
through the parent file (they're reported in the metadata though).
>
> Specifically for your stack traces:
>
> For your initial problem with the missing class exceptions -- I thought we used to catch
those in docx and log them.  I haven't been able to track this down, though.  I can look more
if you have a need.
>
> For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' name 'PolylineTo'
", this problem might go away if we implemented a pure SAX parser for vsdx.  We just did this
for docx and pptx (coming in 1.15) and these are more robust to variation because they aren't
requiring a match with the ooxml schema.  I haven't looked much at vsdx, but that _might_
help.
>
> For "TODO Support v5 Pointers", this isn't supported and would require contributions.
 However, I agree that POI shouldn't throw a Runtime exception.  Perhaps open an issue in
POI, or maybe we should catch this special example at the Tika level?
>
> For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI team _might_ be able
to modify the parser to ignore a stream if there's an exception, but that's often a sign that
something needs to be fixed with the parser.  In short, the solution will come from POI.
>
> Best,
>
>              Tim
>
> -----Original Message-----
> From: Gytis Mikuciunas [mailto:gytmkc@gmail.com]
> Sent: Tuesday, April 11, 2017 1:56 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Thanks for your responses.
> Are there any posibilities to ignore parsing errors and continue indexing?
> because now solr/tika stops parsing whole document if it finds any exception
>
> On Apr 11, 2017 19:51, "Allison, Timothy B." <tallison@mitre.org> wrote:
>
>> You might want to drop a note to the dev or user's list on Apache POI.
>>
>> I'm not extremely familiar with the vsd(x) portion of our code base.
>>
>> The first item ("PolylineTo") may be caused by a mismatch btwn your
>> doc and the ooxml spec.
>>
>> The second item appears to be an unsupported feature.
>>
>> The third item may be an area for improvement within our codebase...I
>> can't tell just from the stacktrace.
>>
>> You'll probably get more helpful answers over on POI.  Sorry, I can't
>> help with this...
>>
>> Best,
>>
>>            Tim
>>
>> P.S.
>>>  3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar
>>
>> You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set
>> of poi-ooxml-schemas-3.15.jar
>>
>>
>>
>
>
> ---
> This email has been checked for viruses by AVG.
> http://www.avg.com
>


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Mime
View raw message