nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <>
Subject [jira] [Updated] (NUTCH-541) Index url field untokenized
Date Fri, 01 Apr 2011 14:35:06 GMT


Markus Jelsma updated NUTCH-541:

Bulk close of legacy issues:

> Index url field untokenized
> ---------------------------
>                 Key: NUTCH-541
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
> Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the untokenized version
of the url field in some contexts : 
> 1. For deleting duplicates by url (at search time). see NUTCH-455
> 2. For restricting the search to a certain url (may be used in the case of RSS search
where each entry in the Rss is added as a distinct document with (possibly) same url ) 
>    query-url extends FieldQueryFilter so: 
>     Query: url:
>     Parsed: url:"http http-www http-www-apache www www-apache apache org"
>     Translated: +url:"http-http-www http-www-http-www-apache http-www-apache-www www-www-apache
www-apache apache org"
> 3. for accessing a document(s) in the search servers in the search servers. (using query
> I suggest we add url as in index-basic and implement a query-url-untoken plugin. 
> doc.add(new Field("url", url.toString(), Field.Store.YES, Field.Index.TOKENIZED));
> doc.add(new Field("url_untoken", url.toString(), Field.Store.NO, Field.Index.UN_TOKENIZED));

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message