nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriele Kahlout (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1001) bin/nutch fetch/parse handle crawl/segments directory
Date Wed, 01 Jun 2011 20:10:51 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gabriele Kahlout updated NUTCH-1001:
------------------------------------

    Attachment: multipleSegs-fetch-parse.patch

This patch modifers Fetcher.java and ParseSegment.java so that before they proceed with fetching/parsing
they check the segment name and if it's not parsable into a long expects it to be a segments
directory (not recursively, to keep my changes minimal) and parses/fetches the subdirectories.

I've added a -lastestOnly (better rename into -lastOnly?) cli option such that only the last
segment in a segments directory is parsed. The criteria for last is an array index (which
thinking of it now is probably arbitrary, but could be easily extended to sort the long timestamps).

A fetch/parse returns 0 if at least one segment was successfully parsed/fetched, and -1 otherwise.

I hope you consider my suggestion. I've tested it on my own script and it worked (miracolously
from the first time).



> bin/nutch fetch/parse handle crawl/segments directory
> -----------------------------------------------------
>
>                 Key: NUTCH-1001
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1001
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>         Attachments: multipleSegs-fetch-parse.patch
>
>
> I'm having issues porting scripts across different systems to support the step of extracting
the latest/only segments resulting from the generate phase.
> Variants include:
> $ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` #[1]
> $ s1=`ls -d crawl/segments/2* | tail -1` #[2]
> $ segment=`$HADOOP_HOME/bin/hadoop dfs -ls crawl/segments | tail -1 | grep -o [a-zA-Z0-9/\-]*
|tail -1`
> $ segment=`$HADOOP_HOME/bin/hdfs -ls crawl/segments | tail -1 | grep -o [a-zA-Z0-9/\-]*
|tail -1`
> And I'm not sure what windows users would have to do. Some users may also do with:
> bin/nutch fetch with crawl/segments/2*
> But I don't see a need in having the user extract/worry-about the latest/only segment,
and have it a described step in every nutch tutorial. More over only fetch and parse expect
a segment while other commands are fine with the directory of segments.
> Therefore, I think it's beneficial if fetch and parse also handle directories of segments.

> [1] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
> [2] http://wiki.apache.org/nutch/NutchTutorial#Command_Line_Searching

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message