nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring
Date Mon, 16 Jul 2007 12:28:04 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512930
] 

Doğacan Güney commented on NUTCH-439:
-------------------------------------

A big +1 from me. Though, it may be useful to break this patch into multiple pieces (fixes
to opic and build system as a seperate patch, core changes as a seperate patch and plugin
as a seperate patch).

IMHO, most usages of URL.getHost should be replaced with this patch's getDomainName. For example,
"host" field in index gets a big boost currently. But it is easy to spam hosts. Just buy a
host 'example.com' then set up your own dns and add 'foo.example.com', 'bar.example.com',
'baz.example.com'. I have actually seen a lot of spam sites that do this. Doing this in linkdb
reduces anchor spam (where 'foo.example.com' gives a link to 'bar.example.com' and nutch considers
this an external link and stores this anchor).

Another example is generator. Instead of partitioning on host or ip, we can partition urls
based on their domains. This doesn't have the overhead of resolving ips (and ip-resolving
also has problems. Urls under the same domain [sometimes even the same url] may be served
from different ips [think load balancers and stuff]) and will be much more polite and resistant
to honey pots.

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch,
tld_plugin_v2.1.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs
are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure,
generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing
the top level domain and optionally boosting is needed for improving the search results and
enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message