tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2479) Handle empty cells in tables uniformly
Date Thu, 10 May 2018 16:46:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470703#comment-16470703

Nick Burch commented on TIKA-2479:

Having hit a similar thing with TIKA-2641, I'm tempted to make the XLS and XLSX parser output
missing left/mid cells up to a limit, but ignore missing rows, and ignore missing right-cells.
That would prevent very sparse spreadsheets from suddenly generating loads more text output
than they currently do, whilst giving us the correct table layout for files with just the
odd missing cell.

I don't want to suddenly make the output from sparse files huge, and I'd rather not add too
many config options for people to need to play around with, but equally we want to try to
avoid surprises for users.

Anyone have any thoughts / suggestions / objections to that plan, before I apply a slightly
modified form of the attached pull request + matching changes for XLS?

> Handle empty cells in tables uniformly
> --------------------------------------
>                 Key: TIKA-2479
>                 URL: https://issues.apache.org/jira/browse/TIKA-2479
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: patch.diff
> It looks like we output a <td/> for empty cells in xls, and tables in doc, docx
and pptx.  However, we don't retain empty cells in xlsx or tables in ppt.  We should make
this handling uniform.

This message was sent by Atlassian JIRA

View raw message