nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: Understanding mapping of field characteristics to index structure
Date Mon, 06 Aug 2012 22:12:07 GMT
Hi,

Tokenization depens whether an analyzer used for the field (non-primitive types) and the tokenization
depends on which tokenizer is defined. Tokenizing a hostname doesn't really make sense with
the default available tokenizers but you can use a KeywordTokenizer with a WordDelmiterFilter
to split it into domains (TLD, SLD, etc). But having a TLD in the same field isn't very useful
for boosting and query time analysis of search words - people usually don't search for a tld
and if they do it should be boosted seperately.

About the Solr4 schema, it wasn't introduced as a Solr4 compatible version of the default
schema.xml file and i think it should be removed in favour of updating the schema.xml to Solr4.The
only change i can think of is adding the version field that is mandatory for SolrCloud. The
schema version is 1.5 which the default schema already has.

Cheers


 
 
-----Original message-----
> From:Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>
> Sent: Tue 07-Aug-2012 00:03
> To: dev@nutch.apache.org
> Subject: Re: Understanding mapping of field characteristics to index structure
> 
> Mmmm...
> 
> I think I opened a small can of worms here regarding consistency
> between schema.xml and schema-solr4.xml.
> 
> There are discrepancies between some fields as to their structural
> characteristics. This is something which I think we should make
> consistent between schemas... no?
> 
> An example would be the content field (used in index-basic) which
> appears as stored and indexed in schema-solr4.xml but not stored in
> schema.xml
> 
> Lewis
> 
> On Mon, Aug 6, 2012 at 10:50 PM, Lewis John Mcgibbney
> <lewis.mcgibbney@gmail.com> wrote:
> > Hi,
> >
> > Simple question but currently unclear to me...
> >
> > I know if a field e.g. 'host' is going to be stored and/or indexed as
> > all I need to do is look this up or define it within my schema,
> > however what about tokenised? This seems (to me anyway) to be shrouded
> > in mystery :0|
> >
> > Any thoughts? Thank you
> >
> > Best
> > Lewis
> >
> > --
> > Lewis
> 
> 
> 
> -- 
> Lewis
> 

Mime
View raw message