tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TIKA-2646) Tika parse["content"] returns jumbled text across cells of a table in a pdf
Date Tue, 22 May 2018 12:28:00 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Allison resolved TIKA-2646.
-------------------------------
    Resolution: Won't Fix

[~adidier] thank you for opening this issue and sharing this with us.  PDFs don't store table
structures per se (like MSWord/PPT do), rather they store coordinates on a page.  Tables have
to be inferred/reconstructed based on those coordinates.  Neither Apache Tika, nor Apache
PDFBox are currently inferring/reconstructing tables.

You might want to look into https://github.com/tabulapdf/tabula-java (which uses PDFBox) to
extract tables.

If you'd like to reopen this issue and request that we integrate tabula into Tika, please
do so.  I'm not sure I'd have the time to do it any time soon, but someone else may.

> Tika parse["content"] returns jumbled text across cells of a table in a pdf
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-2646
>                 URL: https://issues.apache.org/jira/browse/TIKA-2646
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.18
>         Environment: MacOS Sierra 10.12.6
>            Reporter: Annie Didier
>            Priority: Trivial
>              Labels: performance
>
> When text from a table is extracted, sometimes the order of the cells becomes mixed and
the words get concatenated together. For example:
>  
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||DESCRIPTION||
> becomes: Hours Dur Code Sub DescriptionPhase
>  
> In other more serious cases, the text within a cell becomes scrambled with a text from
another cell. Such as:
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||
> |00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK / 
> TESTING|E - RIG OUT
> TESTERS|
> the second row becomes:
> 17.00-00:00 17:00 FLOWBK E - RIG OUT
>  
> TESTERS
>  
> 33 P -
>  
> FLOWBACK /
>  
> TESTING
> Note that the value of the second column has moved to the first column, and the "-" within
the first column is misordered. The last two columns have switched places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message