tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2641) Unit test for consistency between tabular/columnar formats
Date Thu, 03 May 2018 21:57:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463118#comment-16463118

Hudson commented on TIKA-2641:

UNSTABLE: Integrated in Jenkins build Tika-trunk #1481 (See [https://builds.apache.org/job/Tika-trunk/1481/])
Stub a unit test for TIKA-2641 (nick: [https://github.com/apache/tika/commit/d4719f63ffb381dbbfc53e667379389cb26593c1])
* (add) tika-parsers/src/test/java/org/apache/tika/parser/TabularFormatsTest.java

> Unit test for consistency between tabular/columnar formats
> ----------------------------------------------------------
>                 Key: TIKA-2641
>                 URL: https://issues.apache.org/jira/browse/TIKA-2641
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0, 1.18
>            Reporter: Nick Burch
>            Priority: Minor
> We now have a number of parsers which deal with file formats which are either wholey
or optionally "table-based" formats with consistency in the data types held in a given column.
This includes multi-table formats like sqlite, single-table formats like sas7bdat, and anything-goes-table
formats like csv or xlsx
> We should firstly try to create a simple-ish, small but rich file for each of these formats,
similar to what we do for archive formats with the {{test-documents}} archives. Then, we should
add unit tests that verified that, as much as formats permit, you get basically the same XHTML
out for the "same" input. Oh, and fix up any obvious inconsistencies...

This message was sent by Atlassian JIRA

View raw message