nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Spencer (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-12) WebDBReader options to print incoming links
Date Fri, 18 Mar 2005 19:10:20 GMT
WebDBReader options to print incoming links
-------------------------------------------

         Key: NUTCH-12
         URL: http://issues.apache.org/jira/browse/NUTCH-12
     Project: Nutch
        Type: New Feature
 Environment: n/a
    Reporter: David Spencer


It seems that the options to WeDBReader.main() don't show incoming links, so I added code
to do this consistently.

I added 2 new options:

[1]    -linktourl URL

Prints out the links to one URL.

[2]    -dumplinksto

Like -dumplinks but instead prints out incoming links.

What follows is output from "svn diff -N" which I believe is how things are supposed to be
submitted.


dave@smo-eng-10 /cygdrive/f/proj/java/nutch/latest/nutch/trunk/src/java/org/apache/nutch/db
$ svn diff -N WebDBReader.java
Index: WebDBReader.java
===================================================================
--- WebDBReader.java    (revision 158116)
+++ WebDBReader.java    (working copy)
@@ -409,7 +409,7 @@
      */
     public static void main(String argv[]) throws FileNotFoundException, IOException {
         if (argv.length < 2) {
-            System.out.println("Usage: java org.apache.nutch.db.WebDBReader (-local | -ndfs
<namenode:port>) <db> [-pageurl url] | [-pagemd5 md5] | [-dumppageu
rl] | [-dumppagemd5] | [-toppages <k>] | [-linkurl url] | [-linkmd5 md5] | [-dumplinks]
| [-stats]");
+            System.out.println("Usage: java org.apache.nutch.db.WebDBReader (-local | -ndfs
<namenode:port>) <db> [-pageurl url] | [-pagemd5 md5] | [-dumppageu
rl] | [-dumppagemd5] | [-toppages <k>] | [-linkurl url] | [-linktourl url] [-linkmd5
md5] | [-dumplinks] | [-dumplinksto] | [-stats]");
             return;

         }
@@ -521,6 +521,35 @@
                     System.out.println();
                   }
                 }
+            } else if ("-linktourl".equals(cmd)) {
+                String url = argv[i++];
+                Link links[] = reader.getLinks( new UTF8( url.trim()));
+                System.out.println("Found " + links.length + " incoming links.");
+                for ( int j = 0; j < links.length; j++) {
+                    MD5Hash from = links[ j].getFromID();
+                    Page[] ps = reader.getPages( from);
+                    for( int k = 0; k < ps.length; k++) {
+                        System.out.println( " from " + ps[ k].getURL().toString());
+                    }
+                }
+            } else if ("-dumplinksto".equals(cmd)) {
+                System.out.println(reader);
+                System.out.println();
+                Enumeration e = reader.pagesByMD5();
+                while (e.hasMoreElements()) {
+                  Page page = (Page) e.nextElement();
+                  Link[] links = reader.getLinks( page.getURL());
+                  if ( links.length > 0) {
+                      System.out.println( "These pages link to " + page.getURL());
+                      for ( int j = 0; j < links.length; j++) {
+                          MD5Hash from = links[ j].getFromID();
+                          Page[] ps = reader.getPages( from);
+                          for( int k = 0; k < ps.length; k++) {
+                              System.out.println( " from " + ps[ k].getURL().toString());
+                          }
+                      }
+                  }
+                }
             } else if ("-stats".equals(cmd)) {
                 System.out.println("Stats for " + reader);
                 System.out.println("-------------------------------");



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


Mime
View raw message