nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ogjunk-nu...@yahoo.com
Subject Re: Internet crawl: CrawlDb getting big!
Date Wed, 07 May 2008 14:26:24 GMT
You don't have to update CrawlDb after every fetch cycle, so keeping the generated CrawlDatums
from one generate run might be useful.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: wuqi <chee.wu@gmail.com>
> To: nutch-dev@lucene.apache.org; mathijs.homminga@knowlogy.nl
> Sent: Wednesday, May 7, 2008 5:36:39 AM
> Subject: Re: Internet crawl: CrawlDb getting big!
> 
> 
> ----- Original Message ----- 
> From: "Mathijs Homminga" 
> To: 
> Sent: Wednesday, May 07, 2008 5:21 PM
> Subject: Re: Internet crawl: CrawlDb getting big!
> 
> 
> > wuqi wrote:
> >> I am also trying to improve the Generator efficiency. The current Generator

> all the URLs in crawlDB are dumped out and ordered during the map process and 
> the reduce process will try to find top N pages or maxPerhost page for you. If  
> the page amounts in the CrawlDB is much bigger than N, Need all the page be 
> dumped out during map process?  We  may just need to  provide (2~3)*N pages 
> during the map process,and then reduce select N pages from dumped out (2~3)n 
> pages. this might  improve the Generator efficiency ..
> > Yes, the generate process will be faster. But of course less accurate. 
> > And if you're working with generate.max.per.host, then it is likely that 
> > your segment will be less than topN in size.
> >> I think maybe the crawlDB can be stored based on two layers, the first layer

> is Host,the second layer is pageURL.This can improve  efficiency when using  max 
> pages per host to generator fetch list.
> >>  
> > My first thought is that such an approach makes it hard to select the 
> > best scoring urls.
> In my understanding, best scoring url mighte isn't so important. For example if 
> you want 10URLs, I select 50 URLS  for you to chose top10 URLs, this is enough 
> for me.
> 
> > Perhaps we could design the process in such way that some intermediate 
> > results like the part of the crawldb which is sorted during generation 
> > (this contains all urls elegible for fetching) are saved and reused. Why 
> > sort everything again each time when you know only a fraction of the 
> > urls have been updated?
> The crawlDB minght change dramactially after you update you crawlDB from a 
> fetched segement, so a pre-sorted crawlDB might  not usefull during for netx 
> generator
> 
> > 
> > Mathijs
> > 
> >> Hbase can greatly  improve the updateDB efficiency,because no need to dump 
> all URLS in crawldb, it just need to append a new column with  DB_Fetched for 
> the URL fetched. The other benefit brought by Hbase is that we can easily change 
> schema of crawlDB for example add IP address for each URL... I am not familiar  
> with how the HBase behavior under the interface.. so selecting out  might be 
> problem...
> >>  
> >>
> >> ----- Original Message ----- 
> >> From: "Mathijs Homminga" 
> >> To: 
> >> Sent: Wednesday, May 07, 2008 6:28 AM
> >> Subject: Internet crawl: CrawlDb getting big!
> >>
> >>
> >>  
> >>> Hi all,
> >>>
> >>> The time needed to do a generate and an updatedb depends linearly on the

> >>> size of the CrawlDb.
> >>> Our CrawlDb currently contains about 1.5 billion urls (some fetched, but

> >>> most of them unfetched).
> >>> We are using Nutch 0.9 on a 15-node cluster. These are the times needed

> >>> for these jobs:
> >>>
> >>> generate:    8-10 hours
> >>> updatedb:   8-10 hours
> >>>
> >>> Our fetch job takes about 30 hours, in which we fetch and parse about 8

> >>> million docs (limited by our current bandwidth).
> >>> So, we spent about 40% of our time on CrawlDb administration.
> >>>
> >>> The first problem for us was that we didn't make the best use of our 
> >>> bandwidth (40% of the time no fetching). We solved this by designing a 
> >>> system which looks a bit like the FetchCycleOverlap 
> >>> (http://wiki.apache.org/nutch/FetchCycleOverlap) recently suggested by Otis.
> >>>
> >>> Another problem is that as the CrawlDb grows, the admin time increases.

> >>> One way to solve this is by increasing the topN each time so the ratio 
> >>> between admin jobs and the fetch job remains constant. However, we will

> >>> end up with extreme long cycles and large segments. Some of this we 
> >>> solved by generating multiple segments in one generate job and only 
> >>> perform an updatedb when (almost) all of these segments are fetched.
> >>>
> >>> But still. The number of urls we select (generate), and the number of 
> >>> urls we update (updatedb) is very small compared to the size of the 
> >>> CrawlDb. We were wondering if there is a way such that we don't need to

> >>> read in the whole CrawlDb each time.
> >>> How about putting the CrawlDb in HBase? Sorting (generate) might become

> >>> a problem then...
> >>> Is this issue addressed in the Nutch2Architecture?
> >>>
> >>> I'm happily willing to spend some more time on this, so all ideas are 
> >>> welcome.
> >>>
> >>> Thanks,
> >>> Mathijs Homminga
> >>>
> >>> -- 
> >>> Knowlogy
> >>> Helperpark 290 C
> >>> 9723 ZA Groningen
> >>> The Netherlands
> >>> +31 (0)50 2103567
> >>> http://www.knowlogy.nl 
> >>>
> >>> mathijs.homminga@knowlogy.nl
> >>> +31 (0)6 15312977
> >>>
> >>>    
> >> >
> > 
> > -- 
> > Knowlogy
> > Helperpark 290 C
> > 9723 ZA Groningen
> > +31 (0)50 2103567
> > http://www.knowlogy.nl 
> > 
> > mathijs.homminga@knowlogy.nl
> > +31 (0)6 15312977
> > 
> >


Mime
View raw message