tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Skiba <and...@tikalk.com>
Subject Re: Patch: self-contained HTML using Data URI
Date Mon, 14 Jul 2014 15:10:01 GMT
Nobody replied 4 days. I see the context of the message was lost - it is
about  https://issues.apache.org/jira/browse/TIKA-1344

On Thu, Jul 10, 2014 at 6:13 PM, Andrew Skiba <andrew@tikalk.com> wrote:

> Hi Nick,
>
> Took some time, but I glued it all together, so now it works without
> modifying Tika sources, only by using custom handler, extractor and parser.
> It works with WordExtractor, although it is looking as a dirty hack. As I
> could not override the behavior of WordExtractor, in the handler I ignore
> elements <img> if the src is "embedded:xxx", and let trough only images
> with src with data URI.
>
> The problem is – it does not work at all with OOXMLParser, PDFParser, and
> probably others. I could not find in the code of these parsers recursive
> handling of the embedded images, similar to the call to
> handleEmbeddedResource in WordExtractor.handlePictureCharacterRun
>
> So my questions are:
>
> 1. Does my handler, parser and extractor do what you meant?
> 2. Did I miss the call to ParsingEmbeddedDocumentExtractor in OOXMLParser?
> I found img generating code in XWPFWordExtractorDecorator, but the code is
> deep in private functions call tree, and XWPFWordExtractorDecorator is
> pretty much hardwired to OOXMLParser via OOXMLExtractorFactory, so I did
> not see an easy way to inject my code.
>
> Thank you very much.
>
> Andrew.
>
>
> On Wed, Jun 25, 2014 at 12:39 PM, Nick Burch <apache@gagravarr.org> wrote:
>
>> On Wed, 25 Jun 2014, Andrew Skiba wrote:
>>
>>> Let me check I understand you right. WordExtractor will continue to
>>> create
>>> <img src="embedded:filename.jpg"/>
>>>
>>
>> Yes, as will (should..) the other parsers which find embedded resources
>>
>>
>>  and call the ImageParser once for every file name.
>>>
>>
>> No. It'll call your code, as you'll have registered your code as the
>> EmbeddedDocumentExtractor to call for embedded resources like images.
>>
>> (If there isn't one, then a ParsingEmbeddedDocumentExtractor is used,
>> which calls the default parser, which is how it ends up in ImageParser if
>> you're recursing)
>>
>> Nick
>>
>
>

Mime
View raw message