nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Francesco Cipriani <>
Subject Nutch indexes
Date Wed, 15 Jun 2005 16:41:46 GMT
Hi all,
I'm trying to understand how Nutch stores its indexes, analyzing the
source code. But it's not easy and I ask your help.
I saw that each segment is composed of some data structures, such as the
fetchlist entries, the parse_data etc, and they are handled by the
ArrayFile class.
ArrayFile inherits from MapFile and uses simple integers as keys, so
the index we find in each segment subdir is composed by pairs
<integer, position in the data file>
But where is an index like <url -> <segment, position inside segment> > ?
I see that each segment has an index dir, is that index a Lucene
one? And how is it related to the index in the "index" dir at root level? 
(the same level as the segment dir)
Where does Nutch look at to retrieve the content of a page, given
its url?


View raw message