nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "behnam nikbakht (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1199) unfetched URLs problem
Date Tue, 08 Nov 2011 08:36:51 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146163#comment-13146163
] 

behnam nikbakht commented on NUTCH-1199:
----------------------------------------

the problem is huge number of unfetched urls, for example we have only 2000 fetched urls from
a site with 40000 urls
and by command generate, we can not regenerate them and assign segments to them, so we use
freegen command that create segments for unfetched urls and fetch them and update crawldb.
is it a good or bad solution?
                
> unfetched URLs problem
> ----------------------
>
>                 Key: NUTCH-1199
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1199
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher, generator
>            Reporter: behnam nikbakht
>            Priority: Critical
>              Labels: db_unfetched, fetch, freegen, generate, unfetched, updatedb
>
> we write a script to fetch unfetched urls:
> #first dump from readdb to a text file, and extract unfetched urls to a text file:
>         bin/nutch readdb $crawldb -dump $SITE_DIR/tmp/dump_urls.txt -format csv
>         cat $SITE_DIR/tmp/dump_urls.txt/part-00000 | grep db_unfetched > $SITE_DIR/tmp/dump_unf
>         unfetched_urls_file="$SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt"
>         cat $SITE_DIR/tmp/dump_unf | awk -F '"' '{print $2}' >  $unfetched_urls_file
>         unfetched_count=`cat $unfetched_urls_file|wc -l`
> #next, we have a list of unfetched urls in unfetched_urls.txt , then, we use command
freegen to create segments for #these urls, we can not use command generate because these
url's were generated previously
>        if [[ $unfetched_count -lt $it_size ]]
>        then
>                         echo "UNFETCHED $J , $it_size URLs from $unfetched_count generated"
>                         ((J++))
>                         bin/nutch freegen $SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt
$crawlseg
>                         s2=`ls -d $crawlseg/2* | tail -1`
>                         bin/nutch fetch $s2
>                         bin/nutch parse $s2
>                         bin/nutch updatedb $crawldb $s2
>                         echo "bin/nutch updatedb $crawldb $s2" >> $SITE_DIR/updatedblog.txt
>                         get_new_links
>                         exit
>        fi
> # if number of urls are greater than it_size, then package them
>         ij=1
>         while read line
>         do
>                 let "ind = $ij / $it_size"
>                 mkdir $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/
>                 echo $line >> $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt
>                 echo $ind
>                 ((ij++))
>                 let "completed=$ij % $it_size"
>                if [[ $completed -eq 0 ]]
>                then
>                                                                   echo "UNFETCHED $J
, $it_size URLs from $unfetched_count generated"
>                         ((J++))
>                         bin/nutch freegen $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt
$crawlseg
> #finally fetch,parse and update new segment
>                         s2=`ls -d $crawlseg/2* | tail -1`
>                         bin/nutch fetch $s2
>                         bin/nutch parse $s2
>                         rm $crawldb/.locked
>                         bin/nutch updatedb $crawldb $s2
>                         echo "bin/nutch updatedb $crawldb $s2" >> $SITE_DIR/updatedblog.txt
>                fi
>         done <$unfetched_urls_file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message