lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Multiple Words in String
Date Mon, 04 Apr 2011 01:54:47 GMT
Short form:
I think you're going down a rabbit-hole and should just
use synonyms and forget about it.

I'm particularly thinking that a general-purpose solution
that somehow breaks up or combines adjacent tokens
will have consequences that pop out other places that
you don't want and you'll have to fix *that*. I can't think
of a way to do this that wouldn't run that danger.

Long form, think of it as a sermon, it's Sunday after all.

This is the point, in my experience, where you have to ask your
business people "what's it worth to you"? You can handle
any case the come up similar to the examples you've shown
by adding it into your synonyms file - compressing any pair
into it's joined form (as a synonym) and be done with it. This is
a very straight-forward approach that has predictable consequences.

Or you can mess around, possibly for quite some time, trying
to find a general purpose solution that will almost inevitably
lead to unanticipated behavior that you'll then spend lots of time
trying to chase down, time you could have spent putting in
features that your users will actually notice.

Here's a test. Ask your business people to create a list of all the
pairs they want to see treated like this. If your response is any
variant of "we don't have time to do that" then even *they* must
not think it's very important <nasty grin>. And if they do, put
it in your synonyms file and be a hero....

Evil thoughts aside, I'm dead serious. This is the kind of rabbit-hole
that development efforts go down that, in all probability, add almost
zero *value* to the product. There's a way to handle 95% of the cases
that's very easy to implement. It's already there in Solr.

Historically, we in the programming field have done a very poor job
of making it clear to the business folks that every such request has
not only an implementation cost (and we all too often don't include
debugging/maintenance in that cost) but an opportunity cost. We owe it
to the business folks *and ourselves* to clearly explain to them the
cost and let them make the decision whether it's worth it. A decision
based on information. And understand that I'm not knocking the
business folks here. We haven't given them the consequences to weigh,
so how can we fault their decisions?

OK, sermon over <G>. I've just too often said "yes, we can do that"
without thinking to add "and it'll cost 3 weeks of development effort".
Eventually I figured out that adding the estimate and letting the business
folks know what I wouldn't be able to get to because of that time
spent lead to "Oh, never mind".....

Best
Erick

P.S. Ok, it's late Sunday night and I feel like writing long, involved
responses
that aren't entirely on-topic....

On Sun, Apr 3, 2011 at 9:04 PM, Chris Fauerbach <chrisfauerbach@gmail.com>wrote:

> It's not a specific case only ( e.g. microsoft.com),  but it's really a
> multi word issue.
>
> carwash, bookkeeper etc...
>
> I'm ultimately looking for a schema for search and retrieve that's heavily
> focused on 'names'.. these are peoples names, business names etc..   not
> content like large text fields, web sites or anything like that, but
> business data that I'm very succesfully receiving using dataimport
> handlers...  it's these special cases that are really tripping me up .. my
> business folks keep coming up with them!
>
>
> Chris Fauerbach
> chrisfauerbach@gmail.com
>
>
> On Sun, Apr 3, 2011 at 6:51 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Is this a general question or specific? You can handle specific ones by
> > using synonyms.
> >
> > But the general case, that is treating any two pairs of tokens as
> > a single pair seems fraught with unintended consequences, but
> > you know your problem space better than I do.
> >
> > Best
> > Erick
> >
> > On Sat, Apr 2, 2011 at 2:21 PM, Chris Fauerbach <
> chrisfauerbach@gmail.com
> > >wrote:
> >
> > > Good afternoon everyone!
> > > I am stumped, and I would love some help.    I'm new to solr/lucene,
> > > but I have thrown myself into it, so I think I have a solid
> > > understanding.   Using the analysis tool in the admin interface, I see
> > > these words stemmed and processed as I assume they would be, so I'm
> > > stuck.
> > >
> > > In my index, I have two documents, each with a text field, and here
> > > are example values
> > >
> > > 1) microsoft.com
> > > 2) micro soft
> > >
> > > I want to do a search using microsoft or "micro soft" and find both.
> > > I'm using the dismax interface, the fields are properly listed in the
> > > config, and I can find both records, but never at the same time.
> > > Here's my schema.xml for my text field, any thoughts on what I can do
> > > to find these together?
> > >
> > >
> > >    <fieldType name="text" class="solr.TextField"
> > > positionIncrementGap="100">
> > >      <analyzer type="index">
> > >        <tokenizer class="solr.StandardTokenizerFactory"/>
> > >                <filter class="solr.LowerCaseFilterFactory"/>
> > >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt" enablePositionIncrements="true"/>
> > >        <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
> > > preserveOriginal="1"/>
> > >                <filter class="solr.SynonymFilterFactory"
> > > synonyms="syn/index_synonyms.txt" ignoreCase="true" expand="true"/>
> > >                <filter class="solr.EdgeNGramFilterFactory"
> > minGramSize="2"
> > > maxGramSize="15" side="front"/>
> > >                <filter class="solr.EdgeNGramFilterFactory"
> > minGramSize="2"
> > > maxGramSize="15" side="back"/>
> > >        <filter class="solr.SnowballPorterFilterFactory"
> > > language="English" protected="protwords.txt"/>
> > >      </analyzer>
> > >      <analyzer type="query">
> > >        <tokenizer class="solr.StandardTokenizerFactory"/>
> > >                <filter class="solr.LowerCaseFilterFactory"/>
> > >                <filter class="solr.EdgeNGramFilterFactory"
> > minGramSize="2"
> > > maxGramSize="15" side="front"/>
> > >                <filter class="solr.EdgeNGramFilterFactory"
> > minGramSize="2"
> > > maxGramSize="15" side="back"/>
> > >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt" enablePositionIncrements="true"/>
> > >        <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
> > > preserveOriginal="1"/>
> > >        <filter class="solr.SnowballPorterFilterFactory"
> > > language="English" protected="protwords.txt"/>
> > >
> > >      </analyzer>
> > >    </fieldType>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message