nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nils Hoeller <nilshoel...@arcor.de>
Subject Creation of a Graph File with the DB Link Graph Database
Date Mon, 08 Aug 2005 10:35:43 GMT
Hi,

actually my Searcher is running on my Nutch made Indexed.


Everything seems to work out:

So I go on with a main part of my app.

Before Nutch I used Arachnid as a crawler.

During Crawling I used my Method
    /**
     * Each page considered to be inserted in the sitemap graph is stored in the created directories
and the page information like the URL is inserted to the graph file format .
     */
	protected void handleLink(PageInfo p) {
                task.setCurrent(task.getCurrent()+1);
                task.setMessage("Tracking...");
                //int id =  p.getUrl().hashCode();
				String link = p.getUrl().toString();
                System.out.println("Link :" + link );
				int id = p.getUrl().getPath().hashCode();
       			// String title = URLDecoder.decode(p.getTitle());
				String title = p.getTitle();
				int accessCount = (int)(logs.getAccessCountByID(""+id+""));
				if (link == null || title == null || link.length() == 0 || title.length() ==0) return;
					else{
                    try{
                        storeFile(p.getUrl());
                        out.write( "Node{"+"\r\n"+"ID="+id+"\r\n"+"Title="+title+"\r\n"+"URL="+link+"\r\n"+"Number
of Request="+accessCount+"\r\n"+"}"+"\r\n" ); 
                        String parentLink ="";
                         for(int i = 0 ; i < p.getLinksIntern().size(); i++){
                         //    URL urllink = (URL)(urlnode.getLinks().get(i));
                            URL urllink = (URL)( p.getLinksIntern().get(i));
                            out.write("Edge{"+"\r\n"+"Node1="+id+"\r\n"+"Node2="+urllink.getPath().hashCode()+"\r\n"+"}"+"\r\n");
                         // System.out.println("Links :"+ urllink.toString());
                         }
                                
                      }catch(IOException ie){
                          ie.printStackTrace();
                    }
                }
	}

which build me a Graphfile looking like this

Node{
ID=2144181430
Title=Institute of Information Systems Universit�t zu L�beck Schleswig-Holstein
URL=http://www.ifis.uni-luebeck.de/index.html
Number of Request=0
}
Edge{
Node1=2144181430
Node2=-66623770
}
Edge{
Node1=2144181430
Node2=150343685
}
Edge{
Node1=2144181430
Node2=1049931629
}

and so on.....

Now my question: 

How can this be done with the help of the nutch WebDB ? 
Can I query for all Nodes (Sites) with ID, Title, URL and Number of Request
and for all Edges (Links) with parent and target Node ID?


Thank you for your help,

Nils


Mime
View raw message