nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
Date Fri, 27 Jul 2007 08:06:03 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: tld_plugin_v2.3.patch

bq. TLDScoringFilter contains a misspelled field, tldEnties, it should be renamed to tldEntries
Done!
bq. one of the use cases for the "tld" index field that you mention is that users may search
on it. But in the latest patch this field is added with Field.Index.NO, which makes searching
on it impossible. Also, in order to search on arbitrary Lucene fields Nutch needs a Query
filter, so we would need a TLDQueryFilter, which doesn't exist (yet?). 

Well, infact NUTCH-445 covers searching on tlds, namely we would be able to search site:lucene.apache.org,
or site:apache.org or even site:org, therefore i think indexing tld fields and TLDQueryFilter
is not needed. I will delve deeper into NUTCH-445 as soon as i find some time. We can move
domain indexing functionality to index-basic so that it will be generic enough. 

bq. using domain names instead of host names - we need to discuss this further, let's create
a separate issue on this. 
we  can open issues case by case since the patches is expected to have major side effects.


> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, tld_plugin_v2.0.patch,
tld_plugin_v2.1.patch, tld_plugin_v2.2.patch, tld_plugin_v2.3.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs
are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure,
generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing
the top level domain and optionally boosting is needed for improving the search results and
enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message