tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2874) Parsing of 4 mb excel file generates 164 mb worth of words
Date Wed, 15 May 2019 12:45:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840365#comment-16840365
] 

Tim Allison edited comment on TIKA-2874 at 5/15/19 12:44 PM:
-------------------------------------------------------------

Well, that's exciting! :P  

When you decompress the zip file, the first sheet is 100MB.

When I look at the shared strings file, I see: {{count="331360" uniqueCount="143"}} so, yeah,
there's a lot of duplicated data, but this isn't a Tika problem...I don't think.  The issue
is that you can't actually see this easily in Excel.

In short, between the decompression and "pointer" used to reference the shared strings file,
I'm not surprised that you're getting 150MB.


was (Author: tallison@mitre.org):
Well, that's exciting! :P  

When you decompress the zip file, the first sheet is 100MB.

When I look at the shared strings file, I see: {{count="331360" uniqueCount="143"}} so, yeah,
there's a lot of duplicated data, but this isn't a Tika problem...I don't think.  The issue
is that you can't actually see this easily in Excel.

> Parsing of 4 mb excel file generates 164 mb worth of words
> ----------------------------------------------------------
>
>                 Key: TIKA-2874
>                 URL: https://issues.apache.org/jira/browse/TIKA-2874
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20, 1.19.1
>            Reporter: Sébastien Nussbaumer
>            Priority: Major
>         Attachments: excel_that_generates_huge_number_of_words.xlsx, tika-config.xml
>
>
> When I parse the attached 4 mb excel file, I get 164 mb worth of words. When checking
out the words I see that some cells are repeated *many hundred thousand* of times.
> I tried passing the words through the uniq linux command line utility and got a file
with a much more reasonnable size of 16 kb.
> This is the code I use : 
> {code:java}
> TikaConfig config = new TikaConfig(new ClassPathResource("tika-config.xml").getURL());
> Detector detector = config.getDetector();
> Parser autoDetectParser = new AutoDetectParser(config);
> Tika tika = new Tika(detector, autoDetectParser);
> try (LanguageWriter languageWriter = new LanguageWriter(LanguageDetector.getDefaultLanguageDetector().loadModels());
>         OutputStreamWriter outputStreamWriter = new OutputStreamWriter(output, StandardCharsets.UTF_8);
>         CompositeWriter compositeWriter = new CompositeWriter(outputStreamWriter, languageWriter))
{
>     WriteOutContentHandler handler = new WriteOutContentHandler(compositeWriter, indexedChars);
>     ParseContext context = new ParseContext();
>     context.set(Parser.class, tika.getParser());
>     tika.getParser().parse(input, new BodyContentHandler(handler), new Metadata(), context);
> } 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message