tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-354) ProfilingHandler should take a length-limiting parameter
Date Sun, 24 Jan 2010 16:54:17 GMT

    [ https://issues.apache.org/jira/browse/TIKA-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804259#action_12804259
] 

Ken Krugler commented on TIKA-354:
----------------------------------

I'm working on speeding up language identification, since it's consuming close to 90% of the
time during document parsing for my big web crawls.

I have some changes that make it about 2.2x faster for the test files, and a more significant
change (data sampling) that should significantly speed up time for processing larger documents.

One problem is that the confidence level (for certainty) needs to be dropped a bit for when
text is sampled, at least for the unit tests to pass. But based on email conversations with
Ted Dunning, this approach of using an absolute value doesn't work very well in principle,
and fails badly for shorter documents. I've been looking at Ted's paper on a more sophisticated
approach, and will open a separate issue to track that.


> ProfilingHandler should take a length-limiting parameter
> --------------------------------------------------------
>
>                 Key: TIKA-354
>                 URL: https://issues.apache.org/jira/browse/TIKA-354
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.5
>            Reporter: Vivek Magotra
>            Assignee: Ken Krugler
>
> ProfilingHandler currently parses the entire document (thereby analyzing n-grams for
the entire doc).
> ProfilingHandler should take a length-limiting parameter that allows a user to specify
the amount of data that should get analyzed.
> In fact, by default that limit should be set to something like 8K.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message