tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2025) Extraction of long sequences of digits from Excel spreadsheets using Tika 1.13 doesn’t yield the expected results
Date Fri, 22 Jul 2016 13:17:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389470#comment-15389470
] 

Hudson commented on TIKA-2025:
------------------------------

FAILURE: Integrated in tika-2.x-windows #26 (See [https://builds.apache.org/job/tika-2.x-windows/26/])
TIKA-2025 increase number of significant digits extracted in "general" (tallison: rev f4bacf859650abbe438d7e19d6c0abdcd72a5b34)
* tika-test-resources/src/test/resources/test-documents/testEXCEL_big_numbers.xlsx
* tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
* tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/TikaExcelDataFormatter.java
* tika-test-resources/src/test/resources/test-documents/testEXCEL_big_numbers.xls
* tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
* tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
* tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/TikaExcelGeneralFormat.java
* CHANGES.txt


> Extraction of long sequences of digits from Excel spreadsheets using Tika 1.13 doesn’t
yield the expected results
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2025
>                 URL: https://issues.apache.org/jira/browse/TIKA-2025
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Aeham Abushwashi
>            Assignee: Tim Allison
>             Fix For: 2.0, 1.14
>
>         Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit card number,
Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “340229177292566” is extracted from the attached
spreadsheet as 3.40229E+14, which clearly is not the desired output. 
> This works as expected in 1.12 and earlier. I suspect POI’s recent use of org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat
is to blame.
> I think the impact of this issue is significant. There’s plenty of information that
can no longer be reliably extracted from spreadsheets. Think credit card numbers, telephone
numbers and product identifiers to name a few.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message