nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail
Date Mon, 22 Sep 2008 15:02:44 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633281#action_12633281
] 

Andrzej Bialecki  commented on NUTCH-153:
-----------------------------------------

The timeout support has been added to OutlinkExtractor. It's difficult to set a single limit
on the max. time for procesing, because for some formats the processing can legitimately take
a long time.

Additionally, Tika should better handle mime-type detection that the old Nutch code that it
replaced.

> TextParser is only supposed to parse plain text, but if given postscript, it can take
hours and then fail
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-153
>                 URL: https://issues.apache.org/jira/browse/NUTCH-153
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: all
>            Reporter: Paul Baclace
>             Fix For: 1.0.0
>
>         Attachments: TextParser.java.patch
>
>
> If TextParser is given postscript, it can take hours and then fail.  This can be avoided
with careful configuration, but if the server MIME type is wrong and the basename of the URL
has no "file extension", then the this parser will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but
the problem cannot be entirely addressed with that patch since the first call to reg expr
match() can take a long time, despite quantifier limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.
> Actual experience has shown that for safety and fail-safe reasons, it is worth protecting
against GIGO directly in TextParse for this case, even though the suggested fix is not a general
solution.  (A general solution would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message