tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2641) Unit test for consistency between tabular/columnar formats
Date Thu, 03 May 2018 20:42:00 GMT
Nick Burch created TIKA-2641:

             Summary: Unit test for consistency between tabular/columnar formats
                 Key: TIKA-2641
                 URL: https://issues.apache.org/jira/browse/TIKA-2641
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.18, 2.0
            Reporter: Nick Burch

We now have a number of parsers which deal with file formats which are either wholey or optionally
"table-based" formats with consistency in the data types held in a given column. This includes
multi-table formats like sqlite, single-table formats like sas7bdat, and anything-goes-table
formats like csv or xlsx

We should firstly try to create a simple-ish, small but rich file for each of these formats,
similar to what we do for archive formats with the {{test-documents}} archives. Then, we should
add unit tests that verified that, as much as formats permit, you get basically the same XHTML
out for the "same" input. Oh, and fix up any obvious inconsistencies...

This message was sent by Atlassian JIRA

View raw message