nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaurang Patel <>
Subject Content(source code) of web pages crawled by nutch
Date Tue, 12 May 2009 03:20:34 GMT
Hi All,*

*Can anyone help me with this problem?*

Here is my problem:*

I want to get the source code of the hits I get using nutch crawler. I am
not sure whether nutch stores the content of a web page(i.e actual source
code for web page) in the crawled results. I am afraid if it does not!

If nutch stores these contents, do you have idea how can I retrieve the
contents using any nuch libraries? I have my eye on these classes:
NutchBean, Hit, HitDetails. May be I can find some method of these classes
that gives me contents of the page. I am being hopeless from this classes as
no method gets the content of webpage.

Any kind of help is appreciated.


View raw message