nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Bug in existing version of NutchDocumentAnalyzer (Re: [Nutch-dev] Adding title and site to scoring)
Date Wed, 23 Mar 2005 21:42:19 GMT
Piotr Kosiorowski wrote:
> Hello,
> 
> I am attaching the patch in "svn diff" format. I hope it is ok - I do 
[...]

> Index: src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
> ===================================================================
> --- src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java	(revision 158818)
> +++ src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java	(working copy)
> @@ -77,8 +77,9 @@
>    /** Returns a new token stream for text from the named field. */
>    public TokenStream tokenStream(String fieldName, Reader reader) {
>      Analyzer analyzer;
> -    if ("url".equals(fieldName) || ("anchor".equals(fieldName)))
> -      analyzer = ANCHOR_ANALYZER;
> +    if ("url".equals(fieldName) || ("anchor".equals(fieldName))
> +                || ("host".equals(fieldName)) || ("title".equals(fieldName)))
> +            analyzer = ANCHOR_ANALYZER;
>      else
>        analyzer = CONTENT_ANALYZER;

Could somebody confirm/deny my analysis in the previous post, that the 
use of ANCHOR_ANALYZER for "url" is wrong, and the CONTENT_ANALYZER 
should be used instead?

-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message