tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2542) Support in tika-server for getting plain text and metadata at the same time
Date Sat, 06 Jan 2018 02:03:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314293#comment-16314293
] 

ASF GitHub Bot commented on TIKA-2542:
--------------------------------------

mcaracuel opened a new pull request #216: Implementation of TIKA-2542 by mcaracuel
URL: https://github.com/apache/tika/pull/216
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Support in tika-server for getting plain text and metadata at the same time
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-2542
>                 URL: https://issues.apache.org/jira/browse/TIKA-2542
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, server
>    Affects Versions: 1.17
>            Reporter: Manolo Caracuel
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.18
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It would be good to have a way to get a files plain text extracted and also get the metadata
detected. Currently you can only get the metadata if the request has Accepts of text/xml or
text/html but then the text in the body is not the plain text as it contains html elements
as well.
> I propose that when requesting /tika/plain with Accepts header of text/xml, an xhtml
document is returned with the metadata in head's meta elements and the plain text in the body.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message