tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Skiba <and...@tikalk.com>
Subject Re: Patch: self-contained HTML using Data URI
Date Wed, 25 Jun 2014 09:02:55 GMT
Nick, thanks for the reply.

Let me check I understand you right. WordExtractor will continue to create
<img src="embedded:filename.jpg"/>, and call the ImageParser once for every
file name. But the ImageParser will save the image contents somewhere (in
the metadada?) and when XHTMLContentHandler.startElement is called on "img"
by existing WordParser, it will replace the image src with the data URI.

Is that what you meant? If yes - how do I change the configuration to cause
customized content handler and Image Parse to be used instead of the
current ones?

Thank you very much.

Andrew.


On Tue, Jun 24, 2014 at 7:45 PM, Nick Burch <apache@gagravarr.org> wrote:

> On Tue, 24 Jun 2014, Andrew Skiba wrote:
>
>> I started with org.apache.tika.parser.microsoft.WordExtractor and
>> immediately saw that it already makes a recursive call to the
>> org.apache.tika.parser.image.ImageParser. But ImageParser currently only
>> enriches metadata, and does not create <img> element itself. This is done
>> in the WordExtractor and respective handlers for types other, than MS Word.
>>
>
> A would've thought it would only trigger ImageParser if you set the
> AutoDetectParser on the parse context, did you?
>
> My idea was that you'd have a content handler + recursing parse class /
> pair, the handler would re-write the img tag when it came through, and the
> recursing parser would capture the image when that triggers to get the
> image data suitable for the re-write. (This is largely what the Alfresco
> class does that I suggested you look at)
>
> You shouldn't be changing anything in the Word Parser itself, you want to
> be writing something that applies equally to all parsers.
>
> (It might be that you find that one parser is being non-standard about how
> it reports embedded images, in which case you'll need to fix that to follow
> the others, but ideally you shouldn't be touching the built in parsers
> beyond that)
>
> Nick
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message