tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject Re: HTML to PDF conversion
Date Wed, 16 Oct 2019 16:07:32 GMT
+1 to Ken’s earlier point about maintenance. Note Tika wouldn’t even build
in Germany, and we only discovered that because of inviting Tilman. :D We
have a huge amount of maintenance already...

Checkout the incubating Daffodil project that aims to convert files to xml,
validate them and then serialize back to original format.

I do see a use for transform() and if we could use xhtml as an
intermediary, then...maybe, but My inclination is w Ken.

On Wed, Oct 16, 2019 at 11:50 AM Ken Krugler <kkrugler@apache.org> wrote:

> I can see the attraction of one API to convert XHTML to various formats.
>
> Though very quickly that simple API would become complex, as each target
> format has its own conversion options.
>
> And if successful, we’d pull in even more 3rd party jars to handle that
> conversion.
>
> Wonder if there’s a need for a new project called “Akit”, which focuses on
> XHTML -> various formats :)
>
> — Ken
>
> > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin <sberyozkin@gmail.com>
> wrote:
> >
> > Ken, thanks for the feedback, I meant to reply to your comments,
> >
> > I suppose I really meant Tika offering a uniform API to create some
> simple
> > structured PDF/etc files.
> > ContentCreator creator = ContentCreator.get("PDF");
> > creator.addTitle("Introduction to Tika");
> > creator.addText("");
> > creator.addTable("tablename", new LinkedHashMap<String, List<String>>());
> > creator.addAttachment(someImage);
> > creator.complete();
> >
> > It would be consistent with the Tika approach on the read side.
> >
> > Cheers, Sergey
> > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler <kkrugler@apache.org> wrote:
> >
> >> If you’re suggesting ways to make it easier to use something like
> >> YaHPConverter with Tika, definitely yes.
> >>
> >> If you’re talking about integrating this functionality…my personal view
> is
> >> no.
> >>
> >> I think Tika should focus on extracting content from documents, versus
> >> format transformations.
> >>
> >> Tika is an attractive location for functionality like this, since it
> sits
> >> in the middle of a lot of data processing pipelines, but I worry about a
> >> bloated code base, with corresponding challenges in maintenance and
> support.
> >>
> >> Regards,
> >>
> >> — Ken
> >>
> >>
> >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <sberyozkin@gmail.com>
> >> wrote:
> >>>
> >>> Hi All
> >>>
> >>> I've seen a Quarkus user asking how to convert to PDF, and one of my
> >>> colleagues pointed to
> >>>
> >>
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> >>>
> >>> Does it make sense for Tika to offer something related to the text to
> PDF
> >>> (for a start, something on top of that transformer), and then may be
> even
> >>> for other formats ?
> >>>
> >>> Sergey
> >>
> >> --------------------------
> >> Ken Krugler
> >> http://www.scaleunlimited.com
> >> custom big data solutions & training
> >> Hadoop, Cascading, Cassandra & Solr
> >>
> >>
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message