nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Tang <him...@gmail.com>
Subject Re: index segmentation
Date Wed, 08 Jun 2005 09:28:53 GMT
Hi Doug

I don't know the query "field"s in Nutch. Is it the same in Lucene? I
suppose it is.
And the question comes along my deep debugging. In the method
filter(Query input, BooleanQuery output) in FieldQueryFilter class,
one statement looks this:

      // skip non-matching clauses (line 54 here!!)
      if (!c.getField().equals(field))
        continue;
        
I enabled my plugin and feed "textonly:nutch" query string, "field" is
"textonly" and it is right. However, why c.getField() is always
"DEFAULT"? Somthing I should take care in the plugin config file?

my plugin.xml is:
<?xml version="1.0" encoding="UTF-8"?>
<plugin
   ... ...

   <extension id="com.ccs.nutch.searcher.TextOnlyVersionQueryFilter"
              name="TextOnlyVersion Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="TextOnlyVersionQueryFilter"
                      class="com.ccs.nutch.searcher.TextOnlyVersionQueryFilter"
                      fields="textonly"/>
   </extension>

   <extension id="com.ccs.nutch.searcher.GraphicVersionQueryFilter"
              name="GraphicVersion Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="GraphicVersionQueryFilter"
                      class="com.ccs.nutch.searcher.GraphicVersionQueryFilter"
                      fields="graphic"/>
   </extension>
</plugin>


Regards
/Jack

On 6/8/05, Jack Tang <himars@gmail.com> wrote:
> Hi Doug
> 
> Thank you for your suggestion.
> And I modified my indexing filter. If the URL of page contains
> text-only/graphic flag, corresponding content will be tagged with
> "textonly/graphic".
> 
> Here is the code.
> public class MyIndexingFilter {
> 
>  public Document filter(Document doc, Parse parse, FetcherOutput fo)
>    throws IndexingException {
>        ... ...
> 
>    if(TEXTONLY_WEBSITE_TAG != null){
>     pattern = compiler.compile(TEXTONLY_WEBSITE_TAG);
>     if(matcher.contains(url,pattern))
>     doc.add(Field.UnStored("textonly",parse.getText()));
>    }
> 
>    if(GRAPHIC_WEBSITE_TAG != null){
>     pattern = compiler.compile(GRAPHIC_WEBSITE_TAG);
>     if(matcher.contains(url,pattern))
>     doc.add(Field.UnStored("graphic",parse.getText()));
>    }
> 
>    // content is indexed, so that it's searchable, but not stored in index
>        doc.add(Field.UnStored("content", parse.getText()));
> 
>        ... ...
>    }
> 
> }
> 
> And the query filter is simple.
> public class TextOnlyVersionQueryFilter extends FieldQueryFilter {
> 
> public TextOnlyVersionQueryFilter(){
> super("textonly");
> }
> }
> 
> After nutch crawls the whole website, I test the index file via
> luke(query string is "textonly:nutch"), everything is OK, however,
> when I feed the same query string into NutchBean, the result is quite
> different, and it is obviously wrong. It seems NutchBean only shows
> the pages whose depth is 1.
> 
> BTW: the "plugin.includes" property is
> <property>
>  <name>plugin.includes</name>
>  <value>protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|meta)|query-(basic|site|url)|myplugin</value>
>  <description>Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.  By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins.
>  </description>
> </property>
> 
> 
> Any suggestion?
> 
> Regards
> /Jack
> 
> On 6/8/05, Doug Cutting <cutting@nutch.org> wrote:
> > Jack Tang wrote:
> > > The problem is when I try to search "scope:textonly" (I expect it will
> > > list all page in textonly part ), the result is blank. So, what should
> > > I do in order to get the right result?
> >
> > If you use RawFieldQueryFilter, then these are non-scoring, filtering
> > query clauses (triggered by boost=0).  They do not affect ranking.  They
> > must be accompanied by a scoring clause (boost != 0).  This is akin to
> > the following at Google:
> >
> > http://www.google.com/search?q=filetype%3Apdf
> >
> > They could be made into scoring clauses, but that would make searches
> > slower.  We could automatically turn one into a scoring clause when
> > there are no scoring clauses in a query, if needed.  Is it important
> > that you be able to, e.g., find all of the scope:textonly documents,
> > with no other qualifications?
> >
> > Doug
> >
>

Mime
View raw message