tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Gribov <gros...@gmail.com>
Subject Re: [DISCUSS] Enable specific ContentHandler for tika-server
Date Fri, 06 Oct 2017 17:08:45 GMT
My +1 to this idea.

IMHO, second option is more flexible. I also like Nick's suggestion about
using default package for handlers and interpret dot-separated string as
fqcn. Solr does similar thing and it's very convenient to use (but they use
prefix `solr.` for their classes in predefined package and any other is
interpreted as fqcn).

I'll add that you could allow user to pass several comma-separated handlers
to allow build content-handler stack if user wants to.

I would disagree with Sergey about serialized lambdas for 2 reasons:
- it's useful only for java-clients;
- it could bring very nasty bugs leading to RCE class vulnerabilities, so
it's very controversial from security PoV.

On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <totaropeppe@gmail.com>
wrote:

> Hi folks,
>
> if I am not wrong, currently you cannot configure a specific ContentHandler
> while using tika-server. I mean that you can configure your own parser [0]
> but you cannot control which ContentHandler the parser leverages to extract
> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
> StandardsExtractingContentHandler, etc).
> If it is correct, it would be nice to enable the use of specific
> ContentHandlers within tika-server and I would like to discuss how to solve
> this issue generally.
>
> I propose two solutions:
>
>    1. augment the TikaConfig class so that a specific ContentHandler can be
>    used in tika-config.xml;
>    2. determine the ContentHandler to use for parsing through HTTP headers,
>    for example:
>    curl -T filename.pdf http://localhost:9998/meta --header
>    "X-Content-Handler: PhoneExtractingContentHandler"
>    This should affect also the TikaResource.java class.
>
> I look forward to having your feedback. I strongly believe that every user
> who wants to use Tika as a service through tika-server and needs to extract
> content and metadata like phone numbers, standard references, etc would be
> very happy.
>
> Thanks a lot,
> Giuseppe
>
-- 

Best regards,
Konstantin Gribov

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message