nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Asitang Mishra (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
Date Mon, 06 Apr 2015 19:11:12 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481649#comment-14481649
] 

Asitang Mishra commented on NUTCH-1854:
---------------------------------------

what should be the default behavior when we run the crawl script i.e ./bin/crawl and fetcher.parse
set to true.
1. It should parse once and put the parsed content to the segment db. Then go ahead and re
parse during the parse cycle.
2. It should parse once and put the parsed content to the segment db. Does not parse during
the parse cycle and exit politely.

I have tried a 3rd thing, where I am parsing during the fetch step, but nothing is being written
in the DB (It basically solves my problem for developing a runtime UI graph)

> ./bin/crawl fails with a parsing fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1854
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1854
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.11
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> <property>
> >   <name>fetcher.parse</name>
> >   <value>false</value>
> >   <description>If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.</description>
> > </property>
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger to the
crawl script which would check for crawl_parse for a given segment and then skip parsing if
this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message