tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2646) Tika parse["content"] returns jumbled text across cells of a table in a pdf
Date Sat, 26 May 2018 19:08:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491792#comment-16491792
] 

Chris A. Mattmann commented on TIKA-2646:
-----------------------------------------

[~adidier] see comment above from [~lfcnassif]

> Tika parse["content"] returns jumbled text across cells of a table in a pdf
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-2646
>                 URL: https://issues.apache.org/jira/browse/TIKA-2646
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.18
>         Environment: MacOS Sierra 10.12.6
>            Reporter: Annie Didier
>            Priority: Trivial
>              Labels: performance
>
> When text from a table is extracted, sometimes the order of the cells becomes mixed and
the words get concatenated together. For example:
>  
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||DESCRIPTION||
> becomes: Hours Dur Code Sub DescriptionPhase
>  
> In other more serious cases, the text within a cell becomes scrambled with a text from
another cell. Such as:
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||
> |00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK / 
> TESTING|E - RIG OUT
> TESTERS|
> the second row becomes:
> 17.00-00:00 17:00 FLOWBK E - RIG OUT
>  
> TESTERS
>  
> 33 P -
>  
> FLOWBACK /
>  
> TESTING
> Note that the value of the second column has moved to the first column, and the "-" within
the first column is misordered. The last two columns have switched places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message