nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dima Mazmanov <nut...@proservice.ge>
Subject Re[4]: nutch-default.xml configuration
Date Mon, 12 Jun 2006 16:11:52 GMT
Hi,Lourival.

Ok after first indexing you must merge segments,
and if you want to reindex your db, you have to delete segments wich
are older then predefined date, in your case 30 days.
this is my solution, if someone has better , please share your
experience!


> Let explain the problem. I have this shell script:

> #!/bin/bash
> # A simple script to run a Nutch re-crawl
> if [ -n "$1" ]
> then
>   crawl_dir=$1
> else
>   echo "Usage: recrawl crawl_dir [depth] [adddays]"
>   exit 1
> fi

> if [ -n "$2" ]
> then
>   depth=$2
> else
>   depth=5
> fi

> if [ -n "$3" ]
> then
>   adddays=$3
> else
>   adddays=0
> fi

> webdb_dir=$crawl_dir/db
> segments_dir=$crawl_dir/segments
> index_dir=$crawl_dir/index

> # The generate/fetch/update cycle
> for ((i=1; i <= depth ; i++))
> do
>   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
>   segment=`ls -d $segments_dir/* | tail -1`
>   bin/nutch fetch $segment
>   bin/nutch updatedb $webdb_dir $segment
> done

> # Update segments
> mkdir tmp
> bin/nutch updatesegs $webdb_dir $segments_dir tmp
> rm -R tmp

> # Index segments
> for segment in `ls -d $segments_dir/* | tail -$depth`
> do
>   bin/nutch index $segment
> done

> # De-duplicate indexes
> # "bogus" argument is ignored but needed due to
> # a bug in the number of args expected
> bin/nutch dedup $segments_dir bogus

> # Merge indexes
> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

> I got it in this web
> site.<http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html>I
> want to update a web page that was crawled with N links and now have
> M,
where M >> N or M < N. It's a simple example, with a little set o files
> linked in this page. But in a production enviroment it's very important.

> I hope I am being clearly. I'm brazilian and i'm improving my english :).

> Again, Thanks a lot!

> On 6/12/06, Dima Mazmanov <nuther@proservice.ge> wrote:
>>
>> Hi,Lourival.
>>
>> What kind of shell script do you have?
>> You wrote 12 июня 2006 г., 19:51:06:
>>
>> > Ok. So, have you any solution to do this job automatically? I have a
>> shell
>> > script, but I don't see if this really works yet.
>>
>> > Sorry if I'm being redundant. I'm learn about this tool and I have a lot
>> of
>> > questions :).
>>
>> > Thanks!
>>
>> > On 6/12/06, Dima Mazmanov <nuther@proservice.ge> wrote:
>> >>
>> >> Hi,Lourival.
>> >>
>> >>
>> >> You wrote 12 июня 2006 г., 19:33:15:
>> >>
>> >> > Hi all!
>> >>
>> >> > I have a question about nutch-default.xml configuration file. There
>> is a
>> >> > parameter db.default.fetch.interval that is set by default to 30. It
>> >> means
>> >> > that pages from the webdb are recrawled every 30
>> >> > days.<
>> >>
>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02058.html
>> >I
>> >> > want to know if this "recrawled" here means automatic recrawl or I
>> >> > have to
>> >> > execute some shell script before this period to make possible updates
>> to
>> >> my
>> >> > WebDB.
>> >>
>> >> > I really wanna know this because at this time I did not obtain a
>> update
>> >> in
>> >> > fact.
>> >>
>> >> > Thanks a lot!
>> >>
>> >>
>> >> You have to recrawl db manually.
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Dima                          mailto:nuther@proservice.ge
>> >>
>> >>
>>
>>
>>
>>
>>
>> --
>> Regards,
>> Dima                          mailto:nuther@proservice.ge
>>
>>





-- 
Regards,
 Dima                          mailto:nuther@proservice.ge


Mime
View raw message