nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring
Date Tue, 10 Jul 2007 09:24:04 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511362
] 

Andrzej Bialecki  commented on NUTCH-439:
-----------------------------------------

Very nice patch! A couple comments:

* the fix to OPICScoringFilter - I will make this as a separate commit (no need to create
a separate patch).

* IP_PATTERN  - it could be tighter, instead of \\d+ it could use \\d{1,3}

* the DomainStatistics tool: I'd rather see it as a separate JIRA issue. The reason is that
it's a common request for enhancement, but specific requirements vary wildly. Some users prefer
to build a separate DB that holds staistical info and can be used in various steps of the
work cycle, others still prefer one-time tools such as this one.

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs
are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure,
generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing
the top level domain and optionally boosting is needed for improving the search results and
enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message