nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Asitang Mishra (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
Date Tue, 07 Apr 2015 01:38:12 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Asitang Mishra updated NUTCH-1854:
----------------------------------
    Attachment: NUTCH-1854ver1.patch

Added patch NUTCH-1854ver1.patch. This patch makes changes in the ParseSegment class. It makes
sure that the parse step will not try to parse an already parsed segment. Hence, the problem
will not be cauht much later as an exception, but the parsing will be skipped much earlier
(even before creation of a parse-job) giving a message that the parsing has been skipped due
to the specific segment already being present.
Please, give your views.

> ./bin/crawl fails with a parsing fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1854
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1854
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.11
>
>         Attachments: NUTCH-1854ver1.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> <property>
> >   <name>fetcher.parse</name>
> >   <value>false</value>
> >   <description>If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.</description>
> > </property>
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger to the
crawl script which would check for crawl_parse for a given segment and then skip parsing if
this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message