[ https://issues.apache.org/jira/browse/TIKA-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486646#comment-16486646
]
Luis Filipe Nassif commented on TIKA-2646:
------------------------------------------
It does not maintain table structures, but have you tried to enable sortByPosition param in
tika config or PdfParserConfig?
> Tika parse["content"] returns jumbled text across cells of a table in a pdf
> ---------------------------------------------------------------------------
>
> Key: TIKA-2646
> URL: https://issues.apache.org/jira/browse/TIKA-2646
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.18
> Environment: MacOS Sierra 10.12.6
> Reporter: Annie Didier
> Priority: Trivial
> Labels: performance
>
> When text from a table is extracted, sometimes the order of the cells becomes mixed and
the words get concatenated together. For example:
>
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||DESCRIPTION||
> becomes: Hours Dur Code Sub DescriptionPhase
>
> In other more serious cases, the text within a cell becomes scrambled with a text from
another cell. Such as:
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||
> |00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK /
> TESTING|E - RIG OUT
> TESTERS|
> the second row becomes:
> 17.00-00:00 17:00 FLOWBK E - RIG OUT
>
> TESTERS
>
> 33 P -
>
> FLOWBACK /
>
> TESTING
> Note that the value of the second column has moved to the first column, and the "-" within
the first column is misordered. The last two columns have switched places.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
|