nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathijs Homminga <mathijs.hommi...@knowlogy.nl>
Subject Re: Internet crawl: CrawlDb getting big!
Date Wed, 07 May 2008 10:20:01 GMT
I might reconsider how important it is for us that we always get the 
best urls.
Perhaps your situation also applies to us.

The output of the segment (number of docs generated plus the number of 
outlinks) determines how much the crawldb changes after an updatedb. For 
us, this is far less than the size of the crawldb itself. But I'm not 
sure how we can profit from that.

Mathijs


wuqi wrote:
> ----- Original Message ----- 
> From: "Mathijs Homminga" <mathijs.homminga@knowlogy.nl>
> To: <nutch-dev@lucene.apache.org>
> Sent: Wednesday, May 07, 2008 5:21 PM
> Subject: Re: Internet crawl: CrawlDb getting big!
>
>
>   
>> wuqi wrote:
>>     
>>> I am also trying to improve the Generator efficiency. The current Generator all
the URLs in crawlDB are dumped out and ordered during the map process and the reduce process
will try to find top N pages or maxPerhost page for you. If  the page amounts in the CrawlDB
is much bigger than N, Need all the page be dumped out during map process?  We  may just need
to  provide (2~3)*N pages during the map process,and then reduce select N pages from dumped
out (2~3)n pages. this might  improve the Generator efficiency ..
>>>       
>> Yes, the generate process will be faster. But of course less accurate. 
>> And if you're working with generate.max.per.host, then it is likely that 
>> your segment will be less than topN in size.
>>     
>>> I think maybe the crawlDB can be stored based on two layers, the first layer
is Host,the second layer is pageURL.This can improve  efficiency when using  max pages per
host to generator fetch list.
>>>   
>>>       
>> My first thought is that such an approach makes it hard to select the 
>> best scoring urls.
>>     
> In my understanding, best scoring url mighte isn't so important. For example if you want
10URLs, I select 50 URLS  for you to chose top10 URLs, this is enough for me.
>
>   
>> Perhaps we could design the process in such way that some intermediate 
>> results like the part of the crawldb which is sorted during generation 
>> (this contains all urls elegible for fetching) are saved and reused. Why 
>> sort everything again each time when you know only a fraction of the 
>> urls have been updated?
>>     
> The crawlDB minght change dramactially after you update you crawlDB from a fetched segement,
so a pre-sorted crawlDB might  not usefull during for netx generator
>
>   
>> Mathijs
>>
>>     
>>> Hbase can greatly  improve the updateDB efficiency,because no need to dump all
URLS in crawldb, it just need to append a new column with  DB_Fetched for the URL fetched.
The other benefit brought by Hbase is that we can easily change schema of crawlDB for example
add IP address for each URL... I am not familiar  with how the HBase behavior under the interface..
so selecting out  might be problem...
>>>   
>>>
>>> ----- Original Message ----- 
>>> From: "Mathijs Homminga" <mathijs.homminga@knowlogy.nl>
>>> To: <nutch-dev@lucene.apache.org>
>>> Sent: Wednesday, May 07, 2008 6:28 AM
>>> Subject: Internet crawl: CrawlDb getting big!
>>>
>>>
>>>   
>>>       
>>>> Hi all,
>>>>
>>>> The time needed to do a generate and an updatedb depends linearly on the

>>>> size of the CrawlDb.
>>>> Our CrawlDb currently contains about 1.5 billion urls (some fetched, but

>>>> most of them unfetched).
>>>> We are using Nutch 0.9 on a 15-node cluster. These are the times needed 
>>>> for these jobs:
>>>>
>>>> generate:    8-10 hours
>>>> updatedb:   8-10 hours
>>>>
>>>> Our fetch job takes about 30 hours, in which we fetch and parse about 8 
>>>> million docs (limited by our current bandwidth).
>>>> So, we spent about 40% of our time on CrawlDb administration.
>>>>
>>>> The first problem for us was that we didn't make the best use of our 
>>>> bandwidth (40% of the time no fetching). We solved this by designing a 
>>>> system which looks a bit like the FetchCycleOverlap 
>>>> (http://wiki.apache.org/nutch/FetchCycleOverlap) recently suggested by Otis.
>>>>
>>>> Another problem is that as the CrawlDb grows, the admin time increases. 
>>>> One way to solve this is by increasing the topN each time so the ratio 
>>>> between admin jobs and the fetch job remains constant. However, we will 
>>>> end up with extreme long cycles and large segments. Some of this we 
>>>> solved by generating multiple segments in one generate job and only 
>>>> perform an updatedb when (almost) all of these segments are fetched.
>>>>
>>>> But still. The number of urls we select (generate), and the number of 
>>>> urls we update (updatedb) is very small compared to the size of the 
>>>> CrawlDb. We were wondering if there is a way such that we don't need to 
>>>> read in the whole CrawlDb each time.
>>>> How about putting the CrawlDb in HBase? Sorting (generate) might become 
>>>> a problem then...
>>>> Is this issue addressed in the Nutch2Architecture?
>>>>
>>>> I'm happily willing to spend some more time on this, so all ideas are 
>>>> welcome.
>>>>
>>>> Thanks,
>>>> Mathijs Homminga
>>>>
>>>> -- 
>>>> Knowlogy
>>>> Helperpark 290 C
>>>> 9723 ZA Groningen
>>>> The Netherlands
>>>> +31 (0)50 2103567
>>>> http://www.knowlogy.nl
>>>>
>>>> mathijs.homminga@knowlogy.nl
>>>> +31 (0)6 15312977
>>>>
>>>>     
>>>>
>>>>         
>> -- 
>> Knowlogy
>> Helperpark 290 C
>> 9723 ZA Groningen
>> +31 (0)50 2103567
>> http://www.knowlogy.nl
>>
>> mathijs.homminga@knowlogy.nl
>> +31 (0)6 15312977
>>
>>     
> >

-- 
Knowlogy
Helperpark 290 C
9723 ZA Groningen
+31 (0)50 2103567
http://www.knowlogy.nl

mathijs.homminga@knowlogy.nl
+31 (0)6 15312977



Mime
View raw message