nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete
Date Wed, 01 Apr 2015 22:42:53 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391647#comment-14391647
] 

Sebastian Nagel commented on NUTCH-1771:
----------------------------------------

Hi [~chongli], the patch looks clean and extensible, just great. Thanks! What about moving
the code to a new class in o.a.n.segments? It will be useful (in a more generic form) for
other tools as well. The log message in case of a skipped segment could be a warning.

Instead of deleting invalid segments, it's possible to ignore them. That's the case if bin/crawl
is repeatedly scheduled to run an incremental/continuous crawl. If some job fails bin/crawl
exits. A potentially incomplete/corrupted segment is never looked at again, so there's no
problem for later runs of bin/crawl. That's because only CrawlDb (and LinkDb/WebGraph) are
used for persistence in this work-flow, content persists only in Solr/ElasticSearch. It would
be even possible to delete a segment immediately at the end of each cycle. If segments are
kept and used later (reparsed, reindexed, mined for data, etc.), it's necessary to delete
or skip invalid ones. And yes, a tool which automatically detects invalid segments would be
definitely useful!

Making tools more robust by ignoring some segments does not harm. It's the easier way: make
the work-flow detect and delete invalid segments is a bigger effort. Btw., updatedb and web
graph already silently skip segments not containing required subdirs. LinkDb/invertlinks exits
with an exception same as IndexingJob. SegmentMerger is special by performing only a partial
merge excluding a subdir from all segments if this subdir is missing in a single segment.


> Solrindex fails if a segment is corrupted or incomplete
> -------------------------------------------------------
>
>                 Key: NUTCH-1771
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1771
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.8, 1.10
>            Reporter: Diaa
>            Priority: Minor
>             Fix For: 1.11
>
>
> When using solrindex to index multiple segments via -dir segment,
> the indexing fails if one or more segments are corrupted/incomplete (generated but not
fetched for example)
> The failure is simply java.io exception.
> Deleting the segment fixes the issue.
> The expected behavior should be one of the following:
> * skipping the segment and proceeding with others (while logging)
> * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message