nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lufeng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
Date Wed, 08 Apr 2015 14:27:12 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485299#comment-14485299
] 

lufeng commented on NUTCH-1854:
-------------------------------

if we set "fetcher.store.content=false" and "fetcher.parse=false" then the "bin/nutch parse"
command will throw exception to check the input content directory exist. So I think why we
need this parameter because something we set the "fetcher.parse" to true and don't want to
store the content because of slow disk or not much disk space. So I think we can remove this
parameter of "fetcher.store.content" and if the parameter of "fetcher.parse=true" we don't
store the page content.

> ./bin/crawl fails with a parsing fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1854
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1854
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.11
>
>         Attachments: NUTCH-1854ver1.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> <property>
> >   <name>fetcher.parse</name>
> >   <value>false</value>
> >   <description>If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.</description>
> > </property>
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger to the
crawl script which would check for crawl_parse for a given segment and then skip parsing if
this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message