tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Skiba <and...@tikalk.com>
Subject Re: Patch: self-contained HTML using Data URI
Date Thu, 10 Jul 2014 15:13:20 GMT
Hi Nick,

Took some time, but I glued it all together, so now it works without
modifying Tika sources, only by using custom handler, extractor and parser.
It works with WordExtractor, although it is looking as a dirty hack. As I
could not override the behavior of WordExtractor, in the handler I ignore
elements <img> if the src is "embedded:xxx", and let trough only images
with src with data URI.

The problem is – it does not work at all with OOXMLParser, PDFParser, and
probably others. I could not find in the code of these parsers recursive
handling of the embedded images, similar to the call to
handleEmbeddedResource in WordExtractor.handlePictureCharacterRun

So my questions are:

1. Does my handler, parser and extractor do what you meant?
2. Did I miss the call to ParsingEmbeddedDocumentExtractor in OOXMLParser?
I found img generating code in XWPFWordExtractorDecorator, but the code is
deep in private functions call tree, and XWPFWordExtractorDecorator is
pretty much hardwired to OOXMLParser via OOXMLExtractorFactory, so I did
not see an easy way to inject my code.

Thank you very much.


On Wed, Jun 25, 2014 at 12:39 PM, Nick Burch <apache@gagravarr.org> wrote:

> On Wed, 25 Jun 2014, Andrew Skiba wrote:
>> Let me check I understand you right. WordExtractor will continue to create
>> <img src="embedded:filename.jpg"/>
> Yes, as will (should..) the other parsers which find embedded resources
>  and call the ImageParser once for every file name.
> No. It'll call your code, as you'll have registered your code as the
> EmbeddedDocumentExtractor to call for embedded resources like images.
> (If there isn't one, then a ParsingEmbeddedDocumentExtractor is used,
> which calls the default parser, which is how it ends up in ImageParser if
> you're recursing)
> Nick

View raw message