tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey Beryozkin <sberyoz...@gmail.com>
Subject Re: HTML to PDF conversion
Date Wed, 16 Oct 2019 16:03:13 GMT
Such an API would of course have the limitations in that a pretty simple
format specific content could be created, but many PDFs I've seen a very
simple, so I can imagine having for ex TikaPDFCreator implementation of the
ContentCreator interface which would just do some simple delegation to
PDFBox

But anyway, plenty of tools exists for it...

Cheers, Sergey

On Wed, Oct 16, 2019 at 4:59 PM Sergey Beryozkin <sberyozkin@gmail.com>
wrote:

> It was not what I was suggesting. My only proposal was about having a
> simple API (without an attempt to cover all the various format specific
> options at the API level) which would let Tika users quickly create format
> specific content without having to deal with the format specific libraries,
> exactly consistent what it does on the read side.
> I appreciate it can require some effort and by no means I'm pushing for it
>
> Sergey
>
> On Wed, Oct 16, 2019 at 4:50 PM Ken Krugler <kkrugler@apache.org> wrote:
>
>> I can see the attraction of one API to convert XHTML to various formats.
>>
>> Though very quickly that simple API would become complex, as each target
>> format has its own conversion options.
>>
>> And if successful, we’d pull in even more 3rd party jars to handle that
>> conversion.
>>
>> Wonder if there’s a need for a new project called “Akit”, which focuses
>> on XHTML -> various formats :)
>>
>> — Ken
>>
>> > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin <sberyozkin@gmail.com>
>> wrote:
>> >
>> > Ken, thanks for the feedback, I meant to reply to your comments,
>> >
>> > I suppose I really meant Tika offering a uniform API to create some
>> simple
>> > structured PDF/etc files.
>> > ContentCreator creator = ContentCreator.get("PDF");
>> > creator.addTitle("Introduction to Tika");
>> > creator.addText("");
>> > creator.addTable("tablename", new LinkedHashMap<String,
>> List<String>>());
>> > creator.addAttachment(someImage);
>> > creator.complete();
>> >
>> > It would be consistent with the Tika approach on the read side.
>> >
>> > Cheers, Sergey
>> > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler <kkrugler@apache.org>
>> wrote:
>> >
>> >> If you’re suggesting ways to make it easier to use something like
>> >> YaHPConverter with Tika, definitely yes.
>> >>
>> >> If you’re talking about integrating this functionality…my personal
>> view is
>> >> no.
>> >>
>> >> I think Tika should focus on extracting content from documents, versus
>> >> format transformations.
>> >>
>> >> Tika is an attractive location for functionality like this, since it
>> sits
>> >> in the middle of a lot of data processing pipelines, but I worry about
>> a
>> >> bloated code base, with corresponding challenges in maintenance and
>> support.
>> >>
>> >> Regards,
>> >>
>> >> — Ken
>> >>
>> >>
>> >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin <sberyozkin@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi All
>> >>>
>> >>> I've seen a Quarkus user asking how to convert to PDF, and one of my
>> >>> colleagues pointed to
>> >>>
>> >>
>> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
>> >>>
>> >>> Does it make sense for Tika to offer something related to the text to
>> PDF
>> >>> (for a start, something on top of that transformer), and then may be
>> even
>> >>> for other formats ?
>> >>>
>> >>> Sergey
>> >>
>> >> --------------------------
>> >> Ken Krugler
>> >> http://www.scaleunlimited.com
>> >> custom big data solutions & training
>> >> Hadoop, Cascading, Cassandra & Solr
>> >>
>> >>
>>
>> --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message