tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geoff Baskwill (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2479) Handle empty cells in tables uniformly
Date Fri, 11 May 2018 01:20:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471358#comment-16471358

Geoff Baskwill commented on TIKA-2479:

Hi [~gagravarr] ... what I found with rendering HTML tables when they didn't have the right
number of cells was they looked really bad (and are semantically incorrect) without the missing
rows and right-cells. I suppose a post-processing step could go through and fill in the missing
columns for people who knew about this behaviour, but we can't fix the missing rows in post-processing
as the knowledge that the rows are missing is lost.

I would agree that it would be preferable not to add config options, but perhaps there's no
other way to balance between "I'd like to get an HTML table that accurately represents the
sheet content so I can properly extract meaning from it" and "I'd like to have something close
to the old behaviour and amount of output when my sheet has sparse data"?

The motivation for trying to get an accurate representation comes from an accessibility project
I was working on with sheets that had merged cells (another problem that I didn't manage to
fully solve in the time I had available) – without the correct number of cells the merging
gets really wrong really quickly.


> Handle empty cells in tables uniformly
> --------------------------------------
>                 Key: TIKA-2479
>                 URL: https://issues.apache.org/jira/browse/TIKA-2479
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: patch.diff
> It looks like we output a <td/> for empty cells in xls, and tables in doc, docx
and pptx.  However, we don't retain empty cells in xlsx or tables in ppt.  We should make
this handling uniform.

This message was sent by Atlassian JIRA

View raw message