nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-635) LinkAnalysis Tool for Nutch
Date Sun, 15 Jun 2008 16:10:45 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605142#action_12605142
] 

Dennis Kubes commented on NUTCH-635:
------------------------------------

Andrzej Bialecki wrote:

> One more question: you said the algorithm converges, but do you have a reference set
of values from this dataset, calculated using some other pagerank impl? It would be worthwhile
to make sure that the > > values are indeed the PageRank, as described, and not yet
another subtle variation such as our OPIC

I was doing it low tech.  By turning on the debug logging, warning it is a large output, and
using grep you can see the score converge after a few iterations ;)

> There are a few Java packages for computing PageRank, we could adapt one of those to
serve as a baseline:
> 
> http://law.dsi.unimi.it/
> http://webla.sourceforge.net/javadocs/pt/tumba/links/PageRank.html

I agree it would be a good comparison.  Strictly speaking though it is not just pagerank.
 There are optimizations for multiple links from a given domain, penalties for very few inlinks,
and a minimum score value.  All of which are able to be changed through the configuration.
 Besides that it does follow the original pagerank algorithm closely.

> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, NUTCH-635-3-20080614.patch,
NUTCH-635-4-20080615.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a sparse matrix
using inlinks and outlinks and converges after a given number of iterations.  This tool is
mean to replace the current scoring system in nutch with a system that converges instead of
exponentially increasing scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message