nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Winton Davies <wdav...@cs.stanford.edu>
Subject some technical advice
Date Fri, 27 Jun 2008 01:24:55 GMT
Hi,

I had to build an index of a copy of the 3 million 2007 static 
wikipedia pages for a project, and it indexed out of the box just 
following the tutorial, so kudos.

  However, I'm trying to speed up the  query performance - and the 
easiest solution I can think of is to mmap the index file. However I 
have no idea how to do this. Anyone have an idea? Is there some other 
parameter I can tweak to load/cache the index? Is there some form of 
index primer around that will pre-cache the indexes? Currently it's 
about 300 msecs a query (on a really high performance Fedora box with 
8GB ram in the Amazon compute cloud). The index is less than 5GB.

The other question I have is regarding anchor text and link analysis. 
The site is just a dir hierarchy, and I crawled just using 'file:///' 
- do I need to do a http:// crawl to get Anchor Text to work? Or can 
I just run a partial rebuild on the segments? Does 0.9 have an 
approximation of page rank, and if so, does it work on file urls with 
the same host.

Sorry to bug you guys, but I can't find anything on the wiki thats 
really helpful, nor can  anyone on Nutch User supply an answer to 
these 2 topics.


Cheers,
  Winton







Mime
View raw message