lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Germán Cáseres (JIRA) <j...@apache.org>
Subject [jira] [Created] (SOLR-7916) ExtractingDocumentLoader does not initialize context with Parser.class key and DelegatingParser needs that key.
Date Wed, 12 Aug 2015 15:44:45 GMT
Germán Cáseres created SOLR-7916:
------------------------------------

             Summary: ExtractingDocumentLoader does not initialize context with Parser.class
key and DelegatingParser needs that key.
                 Key: SOLR-7916
                 URL: https://issues.apache.org/jira/browse/SOLR-7916
             Project: Solr
          Issue Type: Bug
          Components: contrib - Solr Cell (Tika extraction)
    Affects Versions: 5.1
            Reporter: Germán Cáseres


Tika PDFParser works perfectly with Solr except when you need to extract metadata from embedded
images in PDF.

When PDFParser finds an embedded image, it tries to execute a DelegatingParser over that image.
But the problem is that DelegatingParser expects ParseContext to have Parser.class key.
If that key is not present, it falls back to EmptyParser and inline image metadata is not
extracted.

I tried to extract metadata using standalone Tika and Tesseract OCR and it works fine (the
text from PDF and from OCRed inline images is extracted)... but when i do the same from SolR,
only the text from the PDF is extracted.

I've properly configured PDFParser.properties with "extractInlineImages true"

Also, I tried overriding the PDFParser with a custom one and adding the following line:

{code}
context.set(Parser.class, new AutoDetectParser());
{code}

And it worked... but I think that is not correct to modify the Tika PDFParser if it works
ok when executing without SolR.

Maybe the context should be initialized properly in the SolR class: ExtractingDocumentLoader.

Sorry for my bad English, hope this information is useful, and please tell me if i'm doing
wrong.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message