nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2183) Improvement to SegmentChecker for skipping non-segments present in segments directory
Date Thu, 10 Dec 2015 00:14:11 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049698#comment-15049698
] 

Lewis John McGibbney commented on NUTCH-2183:
---------------------------------------------

Would like to commit today if possible as this is working well for me on trunk. Thank you
to anyone for review. 

> Improvement to SegmentChecker for skipping non-segments present in segments directory
> -------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2183
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2183
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, segment
>    Affects Versions: 1.11
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.12
>
>         Attachments: NUTCH-2183.patch
>
>
> The scenario is that you have a bunch of Nutch data which has been gathered over some
period of time. Some of the data structures are present, some are not. In segments directory
for example there is .zip files (don't ask why) and in other directories there are .tar.gz
files, etc.
> This patch improves the SegmentChecker to skip directories or files present within the
segments directory which are not 14 characters in length as ALL segments are. It also uses
this check for individual segments if used by the IndexingJob. This means that we can prevent
the Indexer blowing up if it is run on one segment (e.g. without -dir option) and detects
some arbitrary directory present within segments/ which actually turns out not to be a segment
afterall.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message