nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2039) Relevance based scoring filter
Date Tue, 16 Jun 2015 17:45:01 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588443#comment-14588443
] 

Lewis John McGibbney commented on NUTCH-2039:
---------------------------------------------

Good work, I am +1 for this patch. 
Some future improvements are:
 * a wiki page explaining exactly what the cosine similarity measure entails, this could be
referenced by a simple README.md in the plugin directory.
 * abstracting the core similarity functionality interfaces as there are many different similarity
metrics which can be used. This would mean that other could contribute similar similarity
algorithms for pages.
Excellent work. I will commit EoB unless objections.

> Relevance based scoring filter
> ------------------------------
>
>                 Key: NUTCH-2039
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2039
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Sujen Shah
>            Assignee: Sujen Shah
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A ScoringFilter plugin that uses a similarity measure to calculate the similarity between
a given page(gold standard) and the currently parsed page. The score obtained from this similarity
is then distributed to its outlinks. This filter aims to focus the crawler to crawl/explore
relevant pages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message