nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Baclace (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
Date Fri, 06 Jan 2006 20:00:16 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362000 ] 

Paul Baclace commented on NUTCH-153:
------------------------------------


> mime.type.magic?

The particular run that had problems was using mime.type.magic=true.  It turns out that the
magic "%!PS-Adobe"  was preceeded by some spaces so it was not recognized.

The intent of this bug is that no matter why some content is passed to TextParser, there should
not be parasitic cases that take too long to process.  (Parsing one file for hours is equivalent
to being fatal.) There are per-file space limits on parsing (first N bytes), but the only
time limit is at the Task level (an hour of inactivity) and it is fatal on the third (default)
attempt. 

It makes sense to have non-fatal per-file time limits on parsers when regular expressions
(OutlinkExtractor) are used since some regexprs are prone to having parasitic cases that take
a long time instead of blowing up a stack.

> strings command line like parser [filter]

This is a related and good idea, but a different beast.  The idea is to improve recall by
grabbing marginal shreds of tokens out of files with unknown formats.  For this to be effective
and not annoying, it needs a threshhold for minimal % of content found, or minimal density,
to accept  any tokens from a particular file in order to reject binary files that just happen
to hit upon reasonable strings. 

(Reasonableness depends on charset/language, as pointed out by KuroSaka TeruHiko, but minimal
ascii, a.k.a. romanji would be the most effective worldwide.) 

It also should have a way to set the weight of the tokens found that would take into account
the density of reasonable tokens.  That is, a similarly sized f.txt would rank higher than
a mystery-format f.huh with the same number of token matches plus 70% binary.



> TextParser is only supposed to parse plain text, but if given postscript, it can take
hours and then fail
> ---------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-153
>          URL: http://issues.apache.org/jira/browse/NUTCH-153
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: all
>     Reporter: Paul Baclace
>  Attachments: TextParser.java.patch
>
> If TextParser is given postscript, it can take hours and then fail.  This can be avoided
with careful configuration, but if the server MIME type is wrong and the basename of the URL
has no "file extension", then the this parser will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but
the problem cannot be entirely addressed with that patch since the first call to reg expr
match() can take a long time, despite quantifier limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.
> Actual experience has shown that for safety and fail-safe reasons, it is worth protecting
against GIGO directly in TextParse for this case, even though the suggested fix is not a general
solution.  (A general solution would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message