nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <>
Subject [jira] Commented: (NUTCH-445) Domain ─░ndexing / Query Filter
Date Tue, 27 Feb 2007 17:45:05 GMT


Doug Cutting commented on NUTCH-445:

Note that the "site" field is also used for search-time deduplication, and that assumes that
each document has only one value for the field (returned from a Lucene FieldCache with raw
hits, for performance).  So this feature should perhaps use a separate field.

That said, I think this should replace the current site-search feature, as it is an improvement
and the industry-standard semantics.  So perhaps a "site:" query should search the "domain:"

> Domain ─░ndexing / Query Filter
> ------------------------------
>                 Key: NUTCH-445
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: index_query_domain_v1.0.patch, index_query_domain_v1.1.patch, TranslatingRawFieldQueryFilter_v1.0.patch
> Hostname's contain information about the domain of th host, and all of the subdomains.
Indexing and Searching the domains are important for intuitive behavior. 
> From DomainIndexingFilter javadoc : 
> Adds the domain(hostname) and all super domains to the index. 
>  * <br> For the 
>  * following will be added to the index : <br> 
>  * <ul>
>  * <li> </li>
>  * <li>apache</li>
>  * <li>org </li>
>  * </ul>
>  * All hostnames are domain names, but not all the domain names are 
>  * hostnames. In the above example hostname lucene is a 
>  * subdomain of, which is itself a subdomain of 
>  * org <br>
>  * 
> Currently Basic indexing filter indexes the hostname in the site field, and query-site
> allows to search in the site field. However will not return
>  By indexing the domain, we can be able to search domains. Unlike 
>  the site field (indexed by BasicIndexingFilter) search, searching the 
>  domain field allows us to retrieve to the query 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message