nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexis (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-965) Parsing takes up 100% CPU
Date Tue, 08 Feb 2011 17:47:02 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexis updated NUTCH-965:
-------------------------

    Attachment: parserJob.patch

In the parser mapper, compare Content-Length header to the size of the content buffer to see
if they match.

If this HTTP header is available and in the case that the file was truncated, skip the parsing
step to avoid that the parser gets stuck in infinite loop taking up all the CPU resources.


Before, in the logs, we would see:

{noformat}2011-02-07 14:03:34,693 WARN  parse.ParseUtil - TIMEOUT parsing http://downtownjoes.com/botb1.flv
with org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:03:34,693 WARN  parse.ParseUtil - Unable to successfully parse content http://downtownjoes.com/botb1.flv
of type video/x-flv
2011-02-07 14:04:04,725 WARN  parse.ParseUtil - TIMEOUT parsing http://downtownjoes.com/dtj.flv
with org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:04,725 WARN  parse.ParseUtil - Unable to successfully parse content http://downtownjoes.com/dtj.flv
of type video/x-flv
2011-02-07 14:04:34,772 WARN  parse.ParseUtil - TIMEOUT parsing http://downtownjoes.com/botb2.flv
with org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:34,772 WARN  parse.ParseUtil - Unable to successfully parse content http://downtownjoes.com/botb2.flv
of type video/x-flv
{noformat} 

After:

{noformat}2011-02-08 09:06:54,482 INFO  parse.ParserJob - http://downtownjoes.com/botb1.flv
skipped. Content of size 4527822 was truncated to 63980
2011-02-08 09:06:54,482 INFO  parse.ParserJob - http://downtownjoes.com/dtj.flv skipped. Content
of size 2692082 was truncated to 63980
2011-02-08 09:06:54,482 INFO  parse.ParserJob - http://downtownjoes.com/botb2.flv skipped.
Content of size 35496213 was truncated to 61058
{noformat} 




> Parsing takes up 100% CPU
> -------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>         Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to
for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message