nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
Date Tue, 10 Jul 2007 07:51:04 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: tld_plugin_v2.0.patch

I have made major improvements to the code and configuration files. Mainly the issue is not
only a plugin, but a package, one big xml file, and an indexing/scoring plugin(which is disabled
by default). The list of recognized suffixes now is not limited to top level domains. second,
or third level public domain names can be recognized. The patch also changes the naming from
top level domains to domain suffixes. 

This patch also introduces URLUtil class, which include methods for getting domain name, or
public domain suffix of an url. Finding the domain name of a url is quite important for several
reasons. First we can use this function as an replacement of URL.getHost() in LinkDB for ignoring
internal links, or in similar context. Second we can perform statistical analysis on domain
names. Third we can list subdomains under a domain, etc.. 

I have changed the build.encoding to UTF-8 so that non-ascii characters are recognized. 

here is an excerpt from the domain-suffixes.xml file : 
       This document contains top level domains 
 	as described by the Internet Assigned Numbers
	Authotiry (IANA), and second or third level domains that 
	are known to be managed by domain registerers. People at 
	Mozilla community call these as public suffixes or effective 
	tlds. There is no algorithmic way of knowing whether a suffix 
	is a public domain suffix, or not. So this large file is used 
	for this purpose. The entries in the file is used to find the
	domain of a url, which may not the same thing as the host of 
	the url. For example for "http://lucene.apache.org/nutch" the 
	hostname is lucene.apache.org, however the domain name for this
	url would be apache.org. Domain names can be quite handy for 
	statistical analysis, and fighting against spam.    
	
	The list of TLDs is constructed from IANA, and the 
	list of "effective tlds" are constructed from Wikipedia, 
	http://wiki.mozilla.org/TLD_List, and http://publicsuffix.org/
	The list may not include all the suffixes, but some
	effort has been spent to make it comprehensive. Please forward 
	any improvements for this list to nutch-dev mailing list, or 
	nutch JIRA. 




> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs
are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure,
generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing
the top level domain and optionally boosting is needed for improving the search results and
enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message