tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2542) Support in tika-server for getting plain text and metadata at the same time
Date Sun, 07 Jan 2018 20:54:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315471#comment-16315471

Nick Burch commented on TIKA-2542:

IIRC most web proxies and servlet runtimes top out at 8kb-16kb of HTTP headers, so for moderately
complicated documents you'll run into issues with the Metadata being too big to fit!

Maybe the answer would be to return something like a Mime Multipart response (similar to emails)
so we can send both the Metadata and XHTML / Plain Text safely in the same response without
server/proxy limits?

> Support in tika-server for getting plain text and metadata at the same time
> ---------------------------------------------------------------------------
>                 Key: TIKA-2542
>                 URL: https://issues.apache.org/jira/browse/TIKA-2542
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, server
>    Affects Versions: 1.17
>            Reporter: Manolo Caracuel
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.18
>   Original Estimate: 48h
>  Remaining Estimate: 48h
> It would be good to have a way to get a files plain text extracted and also get the metadata
detected. Currently you can only get the metadata if the request has Accepts of text/xml or
text/html but then the text in the body is not the plain text as it contains html elements
as well.
> I propose that when requesting /tika/plain with Accepts header of text/xml, an xhtml
document is returned with the metadata in head's meta elements and the plain text in the body.

This message was sent by Atlassian JIRA

View raw message