tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2910) Text extraction using Tika command line and Tika server differs
Date Mon, 12 Aug 2019 17:20:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905402#comment-16905402

Tim Allison commented on TIKA-2910:

No need to wait for me on this one...

The boilerpipe handler isn't relevant here.  What is relevant is that we've hardcoded the
HTMLParser to handle XML in tika-server, but we don't do that in tika-app.

On TIKA-2551, we fixed that in the master branch (Tika 2.0), but we didn't fix it in {{branch_1x}}
because it would be a change in behavior.

If fellow devs are willing to make this breaking change in the 1.x branch, we can do that
for 1.23.  Any objections to making this change in 1.x?

> Text extraction using Tika command line and Tika server differs
> ---------------------------------------------------------------
>                 Key: TIKA-2910
>                 URL: https://issues.apache.org/jira/browse/TIKA-2910
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.21
>            Reporter: Walter
>            Priority: Major
>              Labels: newbie
>         Attachments: CorpusP_25471990.xml
> When extracting TXT from the very same XML file using either Tika command line utility
or the Tika in server mode, the results differ.
> It looks as if PCDATA in deeper nested XML structures are just ignored and only an empty
line is returned.
> I assume both use the same base code. Are there any default settings that may differ
or can be set?

This message was sent by Atlassian JIRA

View raw message