nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nils Höller <nils_hoel...@web.de>
Subject Re: Website Visualization Questions
Date Mon, 11 Jul 2005 15:26:41 GMT

Hi Fredrik,

thanks for that information.
That sounds really good to me. 
I mean it woult be perfect to 
handle just one product instead
of different ones for every single task.

Anyway, can you tell me if it is possible that 
Clients will insert their "ask for a url" into a url list,
out of which Nutch takes everytime the next url to 
do indexing and so on.

I mean: I have read about this :
1. add to url list
2. start nutch for the list 

in the FAQ, but I d like to know 
if it is possible to have nutch 
run as a permanent process, that looks into a specific
file all the time he is ready to do a new job. 
And beside of that, Clients inserting url wishes
into that list.

Is Nutch smart enough to only index sites 
that he has not indexed yet?
So that if an url is prepared he won t start the 
indexing and in that case the user will
be presented the results.
(In think of a method that presents
you the graph if the url is indexed or 
a site "come back later when nutch is finished"
when it is a job for nutch (which means the url 
is put into that url list.)


Thanks very much 
Nils

nutch-dev@lucene.apache.org schrieb am 11.07.05 16:51:28:
> 
> Hi!
> 
> The crawler and link-structure information comes "free" with Nutch.
> Once you have crawled a site, you can use the WebDBReader class to
> extract the link information for further processing in a visualization
> step. Simply put: Iterate crawled pages with the SegmentReader class
> (open the segment you just crawled), extract the url from each page
> (as an MD5Hash object), get the links to/from that url with the
> WebDBReader and pass an appropriate structure to your visualization
> application.
> 
> The structure that you suggested, with edges and nodes, would be very
> easy to implement once you get the hang of the Reader-classes for
> accessing Nutch's gut.
> 
> Fredrik
> 
> On 7/11/05, Nils Hoeller <nils_hoeller@web.de> wrote:
> > Hi,
> > 
> 
> I m actually working on a "service" that gives
> > you the ability to enter a url an visualizes this domain
> > (only inner links).
> > Then there ll be some kind of adaptive behaviour
> > so that the graph will be adapted to your wishes
> > (searches, ranks ...)
> > 
> > I have a prototype that uses:
> > 
> > 1. Arachnid as a crawler
> > 2. Lucene as the indexer
> > 3. Touchgraph for Visualization.
> > 
> > It works as a standalone client,
> > though it seems to be slow, when you
> > enter a new url for visualization
> > (which is ok, because of the crawling and indexing ..)
> > 
> > Now I d like to change the Application:
> > (Arachnid and Lucene should be replaced
> > by Nutch)
> > 
> > My wish is a Service that:
> > 1. Visualizes existing crawled and indexed sites
> > 2. Gives you the feature of entering a new url
> > and works for you while you are online.
> > 
> > So my questions:
> > 
> > 1. Is it possible to do such things with nutch.
> > I mean: Can I start a process that works along
> > a list with urls (does crawling, indexing, and creation of a file that
> > represents the graph structure)
> > , while Clients can enter URLs that will be inserted in this TO-DO list.
> > 
> > 2. I ve read about the web database (including full link graph)
> > Where can I read more of it ? Does it do kind of representation
> > of the site for me automatically?
> > 
> > I mean I need (have done in this former application)
> > something as:
> > 
> > Node{
> > ID=2144181430
> > Title=Institute of Information Systems Universität zu Lübeck
> > Schleswig-Holstein
> > URL=http://www.ifis.uni-luebeck.de/index.html
> > Number of Request=0
> > }
> > Edge{
> > Node1=2144181430
> > Node2=-66623770
> > }
> > Edge{
> > Node1=2144181430
> > Node2=150343685
> > } .....
> > 
> > So I create for every site a node and every link an edge.
> > Is this done with this full link graph database?
> > 
> > Thats all for now.
> > 
> > I ll be so glad , if someone can help me
> > 
> > Thanks Nils
> >

-- 
-------------------------------------------------------------------
Nils Höller

nils_hoeller@web.de
hoeller@informatik.uni-luebeck.de


______________________________________________________________
Verschicken Sie romantische, coole und witzige Bilder per SMS!
Jetzt bei WEB.DE FreeMail: http://f.web.de/?mc=021193


Mime
View raw message