tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser
Date Thu, 26 Jun 2014 16:53:26 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Tim Allison updated TIKA-1300:

    Attachment: tika_1_6_ClassicsVsNonSeq.zip

The attached shows the results of running Tika 1.6 trunk with PDFBox 1.8.6 on a random selection
of 10,000 govdocs1 pdfs.  We used the default (do not extract images) setting.  

On one run, we used the default classic parser, and on the other we used the new (and future
classic) NonSequential Parser (NSP).

Both parsers shared 11 exceptions.  The NSP had 24 exceptions that the classic parser did
not have, and the classic parser had no exceptions that the NSP did not also have.

The contents of the extracted text (at least by unigram token counts), number of attachments
and number of metadata features were nearly identical.  There were only two files where the
number of tokens varied and that was very, very slightly.

The difference in speed was not operationally noticeable:
median per file: 96 millis for classic
median 93 millis for NSP
average per file: 264 millis for classic
average per file: 269 millis for NSP

Given that there were more exceptions with the NSP (admittedly a very small number), I'm hesitant
to change the default parser within Tika to NSP...unless there are benefits that I'm not taking
into consideration.

This corpus clearly has limitations.

Any thoughts or other benchmarks we should consider?

> Switch default PDFBox parser to NonSequentialParser
> ---------------------------------------------------
>                 Key: TIKA-1300
>                 URL: https://issues.apache.org/jira/browse/TIKA-1300
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.7
>         Attachments: tika_1_6_ClassicsVsNonSeq.zip
> On TIKA-1298, [~tilman] recommended switching Tika's default to the NonSequentialParser.
We added a parameter to use the NonSequentialParser in TIKA-1201, and there's some good discussion
there about the benefits.
> Is the community in favor of switching the default now?

This message was sent by Atlassian JIRA

View raw message