tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sébastien Nussbaumer (JIRA) <j...@apache.org>
Subject [jira] [Updated] (TIKA-2874) Parsing of 4 mb excel file generates 164 mb worth of words
Date Wed, 15 May 2019 07:29:00 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sébastien Nussbaumer updated TIKA-2874:
---------------------------------------
    Description: 
When I parse the attached 4 mb excel file, I get 164 mb worth of words. When checking out
the words I see that some cells are repeated *many hundred thousand* of times.

I tried passing the words through the uniq linux command line utility and got a file with
a much more reasonnable size of 16 kb.

This is the code I use : 

{code:java}
TikaConfig config = new TikaConfig(new ClassPathResource("tika-config.xml").getURL());
Detector detector = config.getDetector();
Parser autoDetectParser = new AutoDetectParser(config);
Tika tika = new Tika(detector, autoDetectParser);
try (LanguageWriter languageWriter = new LanguageWriter(LanguageDetector.getDefaultLanguageDetector().loadModels());
        OutputStreamWriter outputStreamWriter = new OutputStreamWriter(output, StandardCharsets.UTF_8);
        CompositeWriter compositeWriter = new CompositeWriter(outputStreamWriter, languageWriter))
{

    WriteOutContentHandler handler = new WriteOutContentHandler(compositeWriter, indexedChars);
    ParseContext context = new ParseContext();
    context.set(Parser.class, tika.getParser());
    tika.getParser().parse(input, new BodyContentHandler(handler), new Metadata(), context);
} 
{code}


  was:
When I parse the attached 4 mb excel file, I get 164 mb worth of words. When checking out
the words I see that some cells are repeated *many hundred thousand* of times.

I tried passing the words through the uniq linux command line utility and got a file with
a much more reasonnable 16 kb file.

This is the code I use : 

{code:java}
TikaConfig config = new TikaConfig(new ClassPathResource("tika-config.xml").getURL());
Detector detector = config.getDetector();
Parser autoDetectParser = new AutoDetectParser(config);
Tika tika = new Tika(detector, autoDetectParser);
try (LanguageWriter languageWriter = new LanguageWriter(LanguageDetector.getDefaultLanguageDetector().loadModels());
        OutputStreamWriter outputStreamWriter = new OutputStreamWriter(output, StandardCharsets.UTF_8);
        CompositeWriter compositeWriter = new CompositeWriter(outputStreamWriter, languageWriter))
{

    WriteOutContentHandler handler = new WriteOutContentHandler(compositeWriter, indexedChars);
    ParseContext context = new ParseContext();
    context.set(Parser.class, tika.getParser());
    tika.getParser().parse(input, new BodyContentHandler(handler), new Metadata(), context);
} 
{code}



> Parsing of 4 mb excel file generates 164 mb worth of words
> ----------------------------------------------------------
>
>                 Key: TIKA-2874
>                 URL: https://issues.apache.org/jira/browse/TIKA-2874
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20, 1.19.1
>            Reporter: Sébastien Nussbaumer
>            Priority: Major
>         Attachments: excel_that_generates_huge_number_of_words.xlsx, tika-config.xml
>
>
> When I parse the attached 4 mb excel file, I get 164 mb worth of words. When checking
out the words I see that some cells are repeated *many hundred thousand* of times.
> I tried passing the words through the uniq linux command line utility and got a file
with a much more reasonnable size of 16 kb.
> This is the code I use : 
> {code:java}
> TikaConfig config = new TikaConfig(new ClassPathResource("tika-config.xml").getURL());
> Detector detector = config.getDetector();
> Parser autoDetectParser = new AutoDetectParser(config);
> Tika tika = new Tika(detector, autoDetectParser);
> try (LanguageWriter languageWriter = new LanguageWriter(LanguageDetector.getDefaultLanguageDetector().loadModels());
>         OutputStreamWriter outputStreamWriter = new OutputStreamWriter(output, StandardCharsets.UTF_8);
>         CompositeWriter compositeWriter = new CompositeWriter(outputStreamWriter, languageWriter))
{
>     WriteOutContentHandler handler = new WriteOutContentHandler(compositeWriter, indexedChars);
>     ParseContext context = new ParseContext();
>     context.set(Parser.class, tika.getParser());
>     tika.getParser().parse(input, new BodyContentHandler(handler), new Metadata(), context);
> } 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message