tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2641) Unit test for consistency between tabular/columnar formats
Date Thu, 03 May 2018 20:42:00 GMT
Nick Burch created TIKA-2641:
--------------------------------

             Summary: Unit test for consistency between tabular/columnar formats
                 Key: TIKA-2641
                 URL: https://issues.apache.org/jira/browse/TIKA-2641
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.18, 2.0
            Reporter: Nick Burch


We now have a number of parsers which deal with file formats which are either wholey or optionally
"table-based" formats with consistency in the data types held in a given column. This includes
multi-table formats like sqlite, single-table formats like sas7bdat, and anything-goes-table
formats like csv or xlsx

We should firstly try to create a simple-ish, small but rich file for each of these formats,
similar to what we do for archive formats with the {{test-documents}} archives. Then, we should
add unit tests that verified that, as much as formats permit, you get basically the same XHTML
out for the "same" input. Oh, and fix up any obvious inconsistencies...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message