tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <apa...@gagravarr.org>
Subject Re: Patch: self-contained HTML using Data URI
Date Tue, 24 Jun 2014 16:45:49 GMT
On Tue, 24 Jun 2014, Andrew Skiba wrote:
> I started with org.apache.tika.parser.microsoft.WordExtractor and 
> immediately saw that it already makes a recursive call to the 
> org.apache.tika.parser.image.ImageParser. But ImageParser currently only 
> enriches metadata, and does not create <img> element itself. This is 
> done in the WordExtractor and respective handlers for types other, than 
> MS Word.

A would've thought it would only trigger ImageParser if you set the 
AutoDetectParser on the parse context, did you?

My idea was that you'd have a content handler + recursing parse class / 
pair, the handler would re-write the img tag when it came through, and the 
recursing parser would capture the image when that triggers to get the 
image data suitable for the re-write. (This is largely what the Alfresco 
class does that I suggested you look at)

You shouldn't be changing anything in the Word Parser itself, you want to 
be writing something that applies equally to all parsers.

(It might be that you find that one parser is being non-standard about how 
it reports embedded images, in which case you'll need to fix that to 
follow the others, but ideally you shouldn't be touching the built in 
parsers beyond that)

Nick

Mime
View raw message