nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring
Date Tue, 10 Jul 2007 09:24:04 GMT


Andrzej Bialecki  commented on NUTCH-439:

Very nice patch! A couple comments:

* the fix to OPICScoringFilter - I will make this as a separate commit (no need to create
a separate patch).

* IP_PATTERN  - it could be tighter, instead of \\d+ it could use \\d{1,3}

* the DomainStatistics tool: I'd rather see it as a separate JIRA issue. The reason is that
it's a common request for enhancement, but specific requirements vary wildly. Some users prefer
to build a separate DB that holds staistical info and can be used in various steps of the
work cycle, others still prefer one-time tools such as this one.

> Top Level Domains Indexing / Scoring
> ------------------------------------
>                 Key: NUTCH-439
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs
are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure,
generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing
the top level domain and optionally boosting is needed for improving the search results and
enhancing locality. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message