lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hermida <leandro.herm...@gmail.com>
Subject how to do auto-suggest case-insensitive match and return original case field values
Date Fri, 04 Dec 2009 18:22:25 GMT

Hi everyone,

New to forum and to Solr, doing my first major project with it and enjoying
it so far, great software.

In my web application I want to set up auto-suggest as you type
functionality which will search case-insensitively yet return the original
case terms.  It doesn't seem like TermsComponent can do this as it can only
return the lowercase indexed terms your are searching against, not the
original case terms.

There was one post on this forum 
http://old.nabble.com/Auto-suggest..-how-to-do-mixed-case-td24106666.html#a24143981
http://old.nabble.com/Auto-suggest..-how-to-do-mixed-case-td24106666.html#a24143981 
where someone asked the same question, and what someone said is to

There is no way to do this right now using TermsComponent. You can index
lower case terms and store the mixed case terms. Then you can use a prefix
query which will return documents (and hence stored field values).

So this got me started, I set out to use Solr Query instead of
TermsComponent to try to do this.  I did the following as mentioned:

<fieldType name="test" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
  </analyzer>
</fieldType>

<fieldType name="test_lc" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<field name="test" type="test" indexed="false" stored="true"
multiValued="true" />
<field name="test_lc" type="test_lc" indexed="true"  stored="false"
multiValued="true" />

And used copyField to populate the test_lc field:

<copyField source="test" dest="test_lc"/>

This is the easy part (the forum user didn't explain the hard part!) It is
very hard to get the same information that TermsComponent returns using the
regular Solr Query functionality!  For example:

http://localhost:8983/solr/terms?terms.fl=test_lc&terms.prefix=a&terms.sort=count&terms.limit=5&omitHeader=true

<int name="a-kinase anchor protein 13">15</int>
<int name="accn5">6</int>
<int name="actin-binding">3</int>
<int name="activator">1</int>
<int name="agie-bp1">1</int>

which provides useful sorting by and returning of term frequency counts in
your index.  How does one get this same information with regular Solr Query? 
I set up the following prefix query, searching by the indexed lowercased
field and returning the other:

http://localhost:8983/solr/select?fl=test&q=test_lc%3Aa*&sort=score+desc&rows=5&omitHeader=true

<doc>
  <arr name="test">
    <str>3D-structure</str>
    <str>acetylation</str>
    <str>alternative promoter usage</str>
    <str>HLC-7</str>
  </arr>
</doc>
<doc>
  <arr name="test">
    <str>alternative splicing</str>
    <str>complete proteome</str>
    <str>DNA-binding</str>
    <str>RACK1</str>
  </arr>
</doc>
<doc>
  <arr name="test">
    <str>acetylation</str>
    <str>AIG21</str>
    <str>WD repeat</str>
    <str>GNB2L1</str>
  </arr>
</doc>
<doc>
</arr>
  <arr name="test">
    <str>3D-structure</str>
    <str>apoptosis</str>
    <str>cathepsin G-like 1</str>
    <str>ATSGL1</str>
    <str>CTLA-1</str>
  </arr>
</doc>
<doc>
  <arr name="test">
    <str>autoantigen Ge-1</str>
    <str>autoantigen RCD-8</str>
    <str>HERV-H LTR-associating protein 3</str>
    <str>HHLA3</str>
  </arr>
</doc>

I can see how to process this in my front-end app to extract the original
terms starting with the prefix letter(s) used in the query, but there are
still some major problems when compared to TermsComponent:

- How do I make sure my auto-suggest list is at least a certain number of
terms long?  Using rows of course doesn't work like terms.limit, because
between returned docs there can be the same term and these will get
collapsed.
- How do I get term frequency counts like TermsComponent does?  I looked at
faceting but I don't understand how to get the TermsComponent behavior using
it.

Sorry for the long message, just wanted to fully explain, thanks for any
help!

leandro

-- 
View this message in context: http://old.nabble.com/how-to-do-auto-suggest-case-insensitive-match-and-return-original-case-field-values-tp26636365p26636365.html
Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message