nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
Date Tue, 10 Jul 2007 14:57:06 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: domain.suffixes_v2.1.patch

> Very nice patch! 
Thanks !
> IP_PATTERN - it could be tighter, instead of \\d+ it could use \\d{1,3}
now it is (\\d{1,3}\\.){3}(\\d{1,3})

>the DomainStatistics tool: I'd rather see it as a separate JIRA issue. The reason is that
it's a common request for enhancement, but specific requirements vary wildly. Some users prefer
to build a separate DB that holds staistical info and can be used in various steps of the
work cycle, others still prefer one-time tools such as this one.

DomainStatistics is really a quick hack i've written for demonstration of the new patch. I've
moved it from the latest patch. Once the user requirements are settled, we can move on from
there. 

Also you may not want to commit MozillaPublicSuffixListParser.java, but it is good we have
it somewhere public. 


> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: domain.suffixes_v2.1.patch, tld_plugin_v1.0.patch, tld_plugin_v1.1.patch,
tld_plugin_v2.0.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs
are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure,
generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing
the top level domain and optionally boosting is needed for improving the search results and
enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message