nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fredrik Andersson <fidde.anders...@gmail.com>
Subject Re: Website Visualization Questions
Date Mon, 11 Jul 2005 20:33:16 GMT
Hi Nils!

If I am not totally off track, the 0.7 version (currently 0.7-dev, in
the CVS trunk) runs as a daemon process. I.e, it will poll the file
with the URL:s when it has nothing else to do, so that will solve your
problem.
Regarding the duplicate content, as you can see in the tutorial there
is a very simple action for deleting duplicates once your crawl has
finised. Personally, I don't see why the Nutch crawler does not keep a
hashset or similar of visited pages. I often get loops where the same
site is crawled over and over again, so if you want to restrict it,
this is not a hard modification to perform if you have ever written
some code. I'm sure Doug has a perfectly good reason as to why the
crawler runs the way it does, I just haven't figured it out (I'm also
quite new to Nutch).

Hope it helps,
Fidde

On 7/11/05, Nils Höller <nils_hoeller@web.de> wrote:
> 
> Hi Fredrik,
> 
> thanks for that information.
> That sounds really good to me.
> I mean it woult be perfect to
> handle just one product instead
> of different ones for every single task.
> 
> Anyway, can you tell me if it is possible that
> Clients will insert their "ask for a url" into a url list,
> out of which Nutch takes everytime the next url to
> do indexing and so on.
> 
> I mean: I have read about this :
> 1. add to url list
> 2. start nutch for the list
> 
> in the FAQ, but I d like to know
> if it is possible to have nutch
> run as a permanent process, that looks into a specific
> file all the time he is ready to do a new job.
> And beside of that, Clients inserting url wishes
> into that list.
> 
> Is Nutch smart enough to only index sites
> that he has not indexed yet?
> So that if an url is prepared he won t start the
> indexing and in that case the user will
> be presented the results.
> (In think of a method that presents
> you the graph if the url is indexed or
> a site "come back later when nutch is finished"
> when it is a job for nutch (which means the url
> is put into that url list.)
> 
> 
> Thanks very much
> Nils
> 
> nutch-dev@lucene.apache.org schrieb am 11.07.05 16:51:28:
> >
> > Hi!
> >
> > The crawler and link-structure information comes "free" with Nutch.
> > Once you have crawled a site, you can use the WebDBReader class to
> > extract the link information for further processing in a visualization
> > step. Simply put: Iterate crawled pages with the SegmentReader class
> > (open the segment you just crawled), extract the url from each page
> > (as an MD5Hash object), get the links to/from that url with the
> > WebDBReader and pass an appropriate structure to your visualization
> > application.
> >
> > The structure that you suggested, with edges and nodes, would be very
> > easy to implement once you get the hang of the Reader-classes for
> > accessing Nutch's gut.
> >
> > Fredrik
> >
> > On 7/11/05, Nils Hoeller <nils_hoeller@web.de> wrote:
> > > Hi,
> > >
> >
> > I m actually working on a "service" that gives
> > > you the ability to enter a url an visualizes this domain
> > > (only inner links).
> > > Then there ll be some kind of adaptive behaviour
> > > so that the graph will be adapted to your wishes
> > > (searches, ranks ...)
> > >
> > > I have a prototype that uses:
> > >
> > > 1. Arachnid as a crawler
> > > 2. Lucene as the indexer
> > > 3. Touchgraph for Visualization.
> > >
> > > It works as a standalone client,
> > > though it seems to be slow, when you
> > > enter a new url for visualization
> > > (which is ok, because of the crawling and indexing ..)
> > >
> > > Now I d like to change the Application:
> > > (Arachnid and Lucene should be replaced
> > > by Nutch)
> > >
> > > My wish is a Service that:
> > > 1. Visualizes existing crawled and indexed sites
> > > 2. Gives you the feature of entering a new url
> > > and works for you while you are online.
> > >
> > > So my questions:
> > >
> > > 1. Is it possible to do such things with nutch.
> > > I mean: Can I start a process that works along
> > > a list with urls (does crawling, indexing, and creation of a file that
> > > represents the graph structure)
> > > , while Clients can enter URLs that will be inserted in this TO-DO list.
> > >
> > > 2. I ve read about the web database (including full link graph)
> > > Where can I read more of it ? Does it do kind of representation
> > > of the site for me automatically?
> > >
> > > I mean I need (have done in this former application)
> > > something as:
> > >
> > > Node{
> > > ID=2144181430
> > > Title=Institute of Information Systems Universität zu Lübeck
> > > Schleswig-Holstein
> > > URL=http://www.ifis.uni-luebeck.de/index.html
> > > Number of Request=0
> > > }
> > > Edge{
> > > Node1=2144181430
> > > Node2=-66623770
> > > }
> > > Edge{
> > > Node1=2144181430
> > > Node2=150343685
> > > } .....
> > >
> > > So I create for every site a node and every link an edge.
> > > Is this done with this full link graph database?
> > >
> > > Thats all for now.
> > >
> > > I ll be so glad , if someone can help me
> > >
> > > Thanks Nils
> > >
> 
> --
> -------------------------------------------------------------------
> Nils Höller
> 
> nils_hoeller@web.de
> hoeller@informatik.uni-luebeck.de
> 
> 
> ______________________________________________________________
> Verschicken Sie romantische, coole und witzige Bilder per SMS!
> Jetzt bei WEB.DE FreeMail: http://f.web.de/?mc=021193
> 
>

Mime
View raw message