tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Skiba <and...@tikalk.com>
Subject Re: Patch: self-contained HTML using Data URI
Date Tue, 24 Jun 2014 16:19:59 GMT
Hello again.

I created an issue https://issues.apache.org/jira/browse/TIKA-1344 for this
patch and got an advise to implement this in a content handler. So I
learned the idea behind RecursiveMetadata and started to look how to move
my change into a handler according to what Nick advised me.

I started with org.apache.tika.parser.microsoft.WordExtractor and
immediately saw that it already makes a recursive call to
the org.apache.tika.parser.image.ImageParser. But ImageParser currently
only enriches metadata, and does not create <img> element itself. This is
done in the WordExtractor and respective handlers for types other, than MS

So my question is - do I have to move the creation of <img> to ImageParser
and remove it from WordExtractor?

Thank you.

On Wed, Jun 18, 2014 at 5:16 PM, Andrew Skiba <andrew@tikalk.com> wrote:

> Hi,
> In the current code, the images from Word documents are referenced by
> "embedded:xxx" links in the generated HTML. This causes the browsers
> display "x" icon instead of the image.
> The proposed patch encodes the images using Data URI, if there is
> -Dtika.parsers.urlimages system property.
> http://en.wikipedia.org/wiki/Data_URI_scheme
> So the default behavior is the same, but users of the library can
> optionally generate self-contained HTML with correct images.
> Thank you,
> Andrew.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message