nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ".: Abishek :." <>
Subject Decoupling crawling and indexing
Date Thu, 10 Feb 2011 07:23:23 GMT
Hi all,

 I am looking for a way to kind of decouple crawling and indexing instead of
tying them together. I am crawling some huge sites and I cannot afford to
wait till the whole crawling is over for searching for the results. I am
kind of working on some proof of concepts so can't wait for long, and also
the target sites cannot be replicated or faked. I know its kind of tough to
do because of the link inversions, deduping and so on.

 Is there a way I can at least try crawling for a day or two then complete
the whole of the process like link inversions, deduping and indexing. Then,
may be come back and start the crawl from where it was left. Kind of a
incremental process?

 Any suggestions on this or pointers would be really of great help.


View raw message