tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey Beryozkin <sberyoz...@gmail.com>
Subject Re: HTML to PDF conversion
Date Thu, 17 Oct 2019 09:30:43 GMT
Hi Tim, All
Sure, agree that Tika is not really about the transformation. etc, it is
just not what I was suggesting, even though I started with a link to IHTML
to PRD transformer. Let me just clarify one more time and I'll be happy to
move on. So, trying to put it into a practical surface:
- create a tika-format-creator (or similarly named) module
- introduce a simple generic API (similarly to the prototype API earlier in
the thread) for creating simple format specific docs and document it is
going to stay experimental for a while
- this API is not about transformation but for Tika users to create the
docs directly
- provide two implementations of this API for a start only, one for PDF,
another one for ODT. In time it may grow a bit to support few more most
used formats, no goal to support hundreds of formats. (This is why I don't
understand the maintenance concern :-) )

In the end the users would be able to use Tika specific API to read and for
some most used formats - create docs.
Tika appeal is about having the uniform API for reading N formats, so the
users don't have to have a code switching between N format specific parser
APIs. But the users working with Tika and having an additional task of
creating some formats still have to go beyond Tika...ending up with a
semi-generic code after all. That was the idea I tried to convey earlier in
the thread...

Thanks all, Sergey


On Wed, Oct 16, 2019 at 5:07 PM Tim Allison <tallison@apache.org> wrote:

> +1 to Ken’s earlier point about maintenance. Note Tika wouldn’t even build
> in Germany, and we only discovered that because of inviting Tilman. :D We
> have a huge amount of maintenance already...
>
> Checkout the incubating Daffodil project that aims to convert files to xml,
> validate them and then serialize back to original format.
>
> I do see a use for transform() and if we could use xhtml as an
> intermediary, then...maybe, but My inclination is w Ken.
>
> On Wed, Oct 16, 2019 at 11:50 AM Ken Krugler <kkrugler@apache.org> wrote:
>
> > I can see the attraction of one API to convert XHTML to various formats.
> >
> > Though very quickly that simple API would become complex, as each target
> > format has its own conversion options.
> >
> > And if successful, we’d pull in even more 3rd party jars to handle that
> > conversion.
> >
> > Wonder if there’s a need for a new project called “Akit”, which focuses
> on
> > XHTML -> various formats :)
> >
> > — Ken
> >
> > > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin <sberyozkin@gmail.com>
> > wrote:
> > >
> > > Ken, thanks for the feedback, I meant to reply to your comments,
> > >
> > > I suppose I really meant Tika offering a uniform API to create some
> > simple
> > > structured PDF/etc files.
> > > ContentCreator creator = ContentCreator.get("PDF");
> > > creator.addTitle("Introduction to Tika");
> > > creator.addText("");
> > > creator.addTable("tablename", new LinkedHashMap<String,
> List<String>>());
> > > creator.addAttachment(someImage);
> > > creator.complete();
> > >
> > > It would be consistent with the Tika approach on the read side.
> > >
> > > Cheers, Sergey
> > > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler <kkrugler@apache.org>
> wrote:
> > >
> > >> If you’re suggesting ways to make it easier to use something like
> > >> YaHPConverter with Tika, definitely yes.
> > >>
> > >> If you’re talking about integrating this functionality…my personal
> view
> > is
> > >> no.
> > >>
> > >> I think Tika should focus on extracting content from documents, versus
> > >> format transformations.
> > >>
> > >> Tika is an attractive location for functionality like this, since it
> > sits
> > >> in the middle of a lot of data processing pipelines, but I worry
> about a
> > >> bloated code base, with corresponding challenges in maintenance and
> > support.
> > >>
> > >> Regards,
> > >>
> > >> — Ken
> > >>
> > >>
> > >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <sberyozkin@gmail.com>
> > >> wrote:
> > >>>
> > >>> Hi All
> > >>>
> > >>> I've seen a Quarkus user asking how to convert to PDF, and one of my
> > >>> colleagues pointed to
> > >>>
> > >>
> >
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> > >>>
> > >>> Does it make sense for Tika to offer something related to the text
to
> > PDF
> > >>> (for a start, something on top of that transformer), and then may be
> > even
> > >>> for other formats ?
> > >>>
> > >>> Sergey
> > >>
> > >> --------------------------
> > >> Ken Krugler
> > >> http://www.scaleunlimited.com
> > >> custom big data solutions & training
> > >> Hadoop, Cascading, Cassandra & Solr
> > >>
> > >>
> >
> > --------------------------
> > Ken Krugler
> > http://www.scaleunlimited.com
> > custom big data solutions & training
> > Hadoop, Cascading, Cassandra & Solr
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message