nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fredrik Andersson <>
Subject Re: Website Visualization Questions
Date Mon, 11 Jul 2005 14:50:54 GMT

The crawler and link-structure information comes "free" with Nutch.
Once you have crawled a site, you can use the WebDBReader class to
extract the link information for further processing in a visualization
step. Simply put: Iterate crawled pages with the SegmentReader class
(open the segment you just crawled), extract the url from each page
(as an MD5Hash object), get the links to/from that url with the
WebDBReader and pass an appropriate structure to your visualization

The structure that you suggested, with edges and nodes, would be very
easy to implement once you get the hang of the Reader-classes for
accessing Nutch's gut.


On 7/11/05, Nils Hoeller <> wrote:
> Hi,
> I m actually working on a "service" that gives
> you the ability to enter a url an visualizes this domain
> (only inner links).
> Then there ll be some kind of adaptive behaviour
> so that the graph will be adapted to your wishes
> (searches, ranks ...)
> I have a prototype that uses:
> 1. Arachnid as a crawler
> 2. Lucene as the indexer
> 3. Touchgraph for Visualization.
> It works as a standalone client,
> though it seems to be slow, when you
> enter a new url for visualization
> (which is ok, because of the crawling and indexing ..)
> Now I d like to change the Application:
> (Arachnid and Lucene should be replaced
> by Nutch)
> My wish is a Service that:
> 1. Visualizes existing crawled and indexed sites
> 2. Gives you the feature of entering a new url
> and works for you while you are online.
> So my questions:
> 1. Is it possible to do such things with nutch.
> I mean: Can I start a process that works along
> a list with urls (does crawling, indexing, and creation of a file that
> represents the graph structure)
> , while Clients can enter URLs that will be inserted in this TO-DO list.
> 2. I ve read about the web database (including full link graph)
> Where can I read more of it ? Does it do kind of representation
> of the site for me automatically?
> I mean I need (have done in this former application)
> something as:
> Node{
> ID=2144181430
> Title=Institute of Information Systems Universit├Ąt zu L├╝beck
> Schleswig-Holstein
> URL=
> Number of Request=0
> }
> Edge{
> Node1=2144181430
> Node2=-66623770
> }
> Edge{
> Node1=2144181430
> Node2=150343685
> } .....
> So I create for every site a node and every link an edge.
> Is this done with this full link graph database?
> Thats all for now.
> I ll be so glad , if someone can help me
> Thanks Nils

View raw message