[ https://issues.apache.org/jira/browse/TIKA-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491792#comment-16491792 ] Chris A. Mattmann commented on TIKA-2646: ----------------------------------------- [~adidier] see comment above from [~lfcnassif] > Tika parse["content"] returns jumbled text across cells of a table in a pdf > --------------------------------------------------------------------------- > > Key: TIKA-2646 > URL: https://issues.apache.org/jira/browse/TIKA-2646 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.18 > Environment: MacOS Sierra 10.12.6 > Reporter: Annie Didier > Priority: Trivial > Labels: performance > > When text from a table is extracted, sometimes the order of the cells becomes mixed and the words get concatenated together. For example: >   > ||HOURS||DUR > (hr)||PHASE||CODE||SUB||DESCRIPTION|| > becomes: Hours Dur Code Sub DescriptionPhase >   > In other more serious cases, the text within a cell becomes scrambled with a text from another cell. Such as: > ||HOURS||DUR > (hr)||PHASE||CODE||SUB|| > |00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK / > TESTING|E - RIG OUT > TESTERS| > the second row becomes: > 17.00-00:00 17:00 FLOWBK E - RIG OUT >   > TESTERS >   > 33 P - >   > FLOWBACK / >   > TESTING > Note that the value of the second column has moved to the first column, and the "-" within the first column is misordered. The last two columns have switched places. -- This message was sent by Atlassian JIRA (v7.6.3#76005)