nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Tang <him...@gmail.com>
Subject Re: index segmentation
Date Thu, 09 Jun 2005 04:09:18 GMT
Hi

Can someone give me a favor on this thread?
I notice that the tranlated query string of "textonly:nutch" is the
same as "textonly nutch". So, any hits to write one FieldQueryFilter?
Many thanks

/Jack



On 6/8/05, Jack Tang <himars@gmail.com> wrote:
> Hi guys.
> 
> I use nutch Query.class to translate my query string, and here is the result:
> ----------------------------------------------------------------------------------------------------------------------
> Query: textonly:nutch
> Parsed: nutctextonly nutch
> Translated: +(url:nutctextonly^4.0 anchor:nutctextonly^2.0
> content:nutctextonly) +(url:nutch^4.0 anchor:nutch^2.0 content:nutch)
> url:"nutctextonly nutch"~2147483647^4.0 anchor:"nutctextonly
> nutch"~4^2.0 content:"nutctextonly nutch"~2147483647
> ---------------------------------------------------------------------------------------------------------------------
> 
> It seems "textonly" field is not searched at all, right?
> 
> Regards
> /Jack
> 
> 
> On 6/8/05, Jack Tang <himars@gmail.com> wrote:
> > Hi Doug
> >
> > I don't know the query "field"s in Nutch. Is it the same in Lucene? I
> > suppose it is.
> > And the question comes along my deep debugging. In the method
> > filter(Query input, BooleanQuery output) in FieldQueryFilter class,
> > one statement looks this:
> >
> >      // skip non-matching clauses (line 54 here!!)
> >      if (!c.getField().equals(field))
> >        continue;
> >
> > I enabled my plugin and feed "textonly:nutch" query string, "field" is
> > "textonly" and it is right. However, why c.getField() is always
> > "DEFAULT"? Somthing I should take care in the plugin config file?
> >
> > my plugin.xml is:
> > <?xml version="1.0" encoding="UTF-8"?>
> > <plugin
> >   ... ...
> >
> >   <extension id="com.ccs.nutch.searcher.TextOnlyVersionQueryFilter"
> >              name="TextOnlyVersion Query Filter"
> >              point="org.apache.nutch.searcher.QueryFilter">
> >      <implementation id="TextOnlyVersionQueryFilter"
> >                      class="com.ccs.nutch.searcher.TextOnlyVersionQueryFilter"
> >                      fields="textonly"/>
> >   </extension>
> >
> >   <extension id="com.ccs.nutch.searcher.GraphicVersionQueryFilter"
> >              name="GraphicVersion Query Filter"
> >              point="org.apache.nutch.searcher.QueryFilter">
> >      <implementation id="GraphicVersionQueryFilter"
> >                      class="com.ccs.nutch.searcher.GraphicVersionQueryFilter"
> >                      fields="graphic"/>
> >   </extension>
> > </plugin>
> >
> >
> > Regards
> > /Jack
> >
> > On 6/8/05, Jack Tang <himars@gmail.com> wrote:
> > > Hi Doug
> > >
> > > Thank you for your suggestion.
> > > And I modified my indexing filter. If the URL of page contains
> > > text-only/graphic flag, corresponding content will be tagged with
> > > "textonly/graphic".
> > >
> > > Here is the code.
> > > public class MyIndexingFilter {
> > >
> > >  public Document filter(Document doc, Parse parse, FetcherOutput fo)
> > >    throws IndexingException {
> > >        ... ...
> > >
> > >    if(TEXTONLY_WEBSITE_TAG != null){
> > >     pattern = compiler.compile(TEXTONLY_WEBSITE_TAG);
> > >     if(matcher.contains(url,pattern))
> > >     doc.add(Field.UnStored("textonly",parse.getText()));
> > >    }
> > >
> > >    if(GRAPHIC_WEBSITE_TAG != null){
> > >     pattern = compiler.compile(GRAPHIC_WEBSITE_TAG);
> > >     if(matcher.contains(url,pattern))
> > >     doc.add(Field.UnStored("graphic",parse.getText()));
> > >    }
> > >
> > >    // content is indexed, so that it's searchable, but not stored in index
> > >        doc.add(Field.UnStored("content", parse.getText()));
> > >
> > >        ... ...
> > >    }
> > >
> > > }
> > >
> > > And the query filter is simple.
> > > public class TextOnlyVersionQueryFilter extends FieldQueryFilter {
> > >
> > > public TextOnlyVersionQueryFilter(){
> > > super("textonly");
> > > }
> > > }
> > >
> > > After nutch crawls the whole website, I test the index file via
> > > luke(query string is "textonly:nutch"), everything is OK, however,
> > > when I feed the same query string into NutchBean, the result is quite
> > > different, and it is obviously wrong. It seems NutchBean only shows
> > > the pages whose depth is 1.
> > >
> > > BTW: the "plugin.includes" property is
> > > <property>
> > >  <name>plugin.includes</name>
> > >  <value>protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|meta)|query-(basic|site|url)|myplugin</value>
> > >  <description>Regular expression naming plugin directory names to
> > >  include.  Any plugin not matching this expression is excluded.  By
> > >  default Nutch includes crawling just HTML and plain text via HTTP,
> > >  and basic indexing and search plugins.
> > >  </description>
> > > </property>
> > >
> > >
> > > Any suggestion?
> > >
> > > Regards
> > > /Jack
> > >
> > > On 6/8/05, Doug Cutting <cutting@nutch.org> wrote:
> > > > Jack Tang wrote:
> > > > > The problem is when I try to search "scope:textonly" (I expect it
will
> > > > > list all page in textonly part ), the result is blank. So, what should
> > > > > I do in order to get the right result?
> > > >
> > > > If you use RawFieldQueryFilter, then these are non-scoring, filtering
> > > > query clauses (triggered by boost=0).  They do not affect ranking.  They
> > > > must be accompanied by a scoring clause (boost != 0).  This is akin to
> > > > the following at Google:
> > > >
> > > > http://www.google.com/search?q=filetype%3Apdf
> > > >
> > > > They could be made into scoring clauses, but that would make searches
> > > > slower.  We could automatically turn one into a scoring clause when
> > > > there are no scoring clauses in a query, if needed.  Is it important
> > > > that you be able to, e.g., find all of the scope:textonly documents,
> > > > with no other qualifications?
> > > >
> > > > Doug
> > > >
> > >
> >
>

Mime
View raw message