tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Annie Didier (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2646) Tika parse["content"] returns jumbled text across cells of a table in a pdf
Date Mon, 21 May 2018 15:53:00 GMT
Annie Didier created TIKA-2646:
----------------------------------

             Summary: Tika parse["content"] returns jumbled text across cells of a table in
a pdf
                 Key: TIKA-2646
                 URL: https://issues.apache.org/jira/browse/TIKA-2646
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.18
         Environment: MacOS Sierra 10.12.6
            Reporter: Annie Didier


When text from a table is extracted, sometimes the order of the cells becomes mixed and the
words get concatenated together. For example:

 
||HOURS||DUR
(hr)||PHASE||CODE||SUB||DESCRIPTION||

becomes: Hours Dur Code Sub DescriptionPhase

 

In other more serious cases, the text within a cell becomes scrambled with a text from another
cell. Such as:
||HOURS||DUR
(hr)||PHASE||CODE||SUB||
|00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK / 
TESTING|E - RIG OUT
TESTERS|

the second row becomes:

17.00-00:00 17:00 FLOWBK E - RIG OUT

 

TESTERS

 

33 P -

 

FLOWBACK /

 

TESTING

Note that the value of the second column has moved to the first column, and the "-" within
the first column is misordered. The last two columns have switched places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message