nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <>
Subject [jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
Date Tue, 07 Nov 2006 13:16:51 GMT
     [ ]

Enis Soztutar updated NUTCH-389:

    Attachment: urlTokenizer-improved.diff

This is an improvement and a minor bug fix over the previous url tokenizer. This version first
replaces characters, which are represented in hexadecimal format in the urls. 

For example the url "file:///tmp/foo%20baz%20bar/foo/baz~bar/index.html" will first be converted
to "file:///tmp/foo baz bar/foo/baz~bar/index.html" by replacing the %20 characters with the

A NullPointerException is corrected in case or input reader returning null for the url. 

Further improvements on the url tokenization can be discussed here. 

> a url tokenizer implementation for tokenizing index fields : url and host
> -------------------------------------------------------------------------
>                 Key: NUTCH-389
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Priority: Minor
>         Attachments: urlTokenizer-improved.diff, urlTokenizer.diff
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token seperators,
which is in the case of the urls not appropriate. So i have written a url tokenizer which
the tokens that match the regular exp [a-zA-Z0-9]. As stated in
which describes the grammer for URIs, URL's can be tokenized with the above expression. 
> NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the "url", "site"
and "host" fields.
> see :

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


View raw message