nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Tang <him...@gmail.com>
Subject Re: index segmentation
Date Wed, 08 Jun 2005 03:59:18 GMT
Hi Doug

Thank you for your suggestion.
And I modified my indexing filter. If the URL of page contains
text-only/graphic flag, corresponding content will be tagged with
"textonly/graphic".

Here is the code.
public class MyIndexingFilter {

  public Document filter(Document doc, Parse parse, FetcherOutput fo)
    throws IndexingException {
        ... ...
        
    if(TEXTONLY_WEBSITE_TAG != null){
     pattern = compiler.compile(TEXTONLY_WEBSITE_TAG);
     if(matcher.contains(url,pattern))
     doc.add(Field.UnStored("textonly",parse.getText()));
    }
    
    if(GRAPHIC_WEBSITE_TAG != null){
     pattern = compiler.compile(GRAPHIC_WEBSITE_TAG);
     if(matcher.contains(url,pattern))
     doc.add(Field.UnStored("graphic",parse.getText()));
    }
    
    // content is indexed, so that it's searchable, but not stored in index
        doc.add(Field.UnStored("content", parse.getText()));
    
        ... ...
    }

}

And the query filter is simple.
public class TextOnlyVersionQueryFilter extends FieldQueryFilter {

public TextOnlyVersionQueryFilter(){
super("textonly");
}
}

After nutch crawls the whole website, I test the index file via
luke(query string is "textonly:nutch"), everything is OK, however,
when I feed the same query string into NutchBean, the result is quite
different, and it is obviously wrong. It seems NutchBean only shows
the pages whose depth is 1.

BTW: the "plugin.includes" property is 
<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|meta)|query-(basic|site|url)|myplugin</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>


Any suggestion? 

Regards
/Jack

On 6/8/05, Doug Cutting <cutting@nutch.org> wrote:
> Jack Tang wrote:
> > The problem is when I try to search "scope:textonly" (I expect it will
> > list all page in textonly part ), the result is blank. So, what should
> > I do in order to get the right result?
> 
> If you use RawFieldQueryFilter, then these are non-scoring, filtering
> query clauses (triggered by boost=0).  They do not affect ranking.  They
> must be accompanied by a scoring clause (boost != 0).  This is akin to
> the following at Google:
> 
> http://www.google.com/search?q=filetype%3Apdf
> 
> They could be made into scoring clauses, but that would make searches
> slower.  We could automatically turn one into a scoring clause when
> there are no scoring clauses in a query, if needed.  Is it important
> that you be able to, e.g., find all of the scope:textonly documents,
> with no other qualifications?
> 
> Doug
>

Mime
View raw message