lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Nylund <jnyl...@yahoo.com>
Subject Re: # in query
Date Tue, 08 Dec 2009 19:25:08 GMT
ok, I just realized I was using the luke handler, didnt know there was  
a fat client, I assume thats what you are talking about.

I downloaded the lukeall.jar, ran it, pointed to my index, found the  
document in question, didn't see how it was tokenized, but I clicked  
the "reconstruct & edit" button,

this gives me a tab that has tokenized per field, for this field it  
shows:


"s|s, ecapsym|myspace, golb|blog"

title is: "#######'s myspace blog"

schema is:

  <!-- A general unstemmed text field that indexes tokens normally and  
also
          reversed (via ReversedWildcardFilterFactory), to enable more  
efficient
	 leading wildcard queries. -->
     <fieldType name="text_rev" class="solr.TextField"  
positionIncrementGap="100">
       <analyzer type="index">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true"  
words="stopwords.txt" enablePositionIncrements="true" />
         <filter class="solr.WordDelimiterFilterFactory"  
generateWordParts="1" generateNumberParts="1" catenateWords="1"  
catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.ReversedWildcardFilterFactory"  
withOriginal="true"
            maxPosAsterisk="3" maxPosQuestion="2"  
maxFractionAsterisk="0.33"/>
       </analyzer>
       <analyzer type="query">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.SynonymFilterFactory"  
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
         <filter class="solr.StopFilterFactory"
                 ignoreCase="true"
                 words="stopwords.txt"
                 enablePositionIncrements="true"
                 />
         <filter class="solr.WordDelimiterFilterFactory"  
generateWordParts="1" generateNumberParts="1" catenateWords="0"  
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
         <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
     </fieldType>


	<field name="textTitle" type="text_rev" indexed="true" stored="true"  
required="false" multiValued="false"/>



thanks
Joel




On Dec 8, 2009, at 11:14 AM, Erick Erickson wrote:

> In Luke, there's a tab that will let you go to a document ID. From  
> there
> you can see all the fields in a particular document, and examine what
> the actual tokens stored are. Until and unless you know what tokens
> are being indexed, you simply can't know what your queries should look
> like.......
>
> *Assuming* that the ### are getting indexed and *assuming* your  
> tokenizer
> tokenized on, whitespace, and *assuming* that by text_rev you
> are talking about ReversedWildcardFilterFactory, I
> wouldn't expect a search to match if it wasn't exactly:
> s'#######. But as you see, there's a long chain of assumptions there  
> any
> one of which may be violated by your schema. So please post the
> relevant portions of your schema to make it easier to help.
>
> Best
> Erick
>
>
> On Tue, Dec 8, 2009 at 9:54 AM, Joel Nylund <jnylund@yahoo.com> wrote:
>
>> Thanks Eric,
>>
>> I looked more into this, but still stuck:
>>
>> I have this field indexed using text_rev
>>
>> I looked at the luke analysis for this field, but im unsure how to  
>> read it.
>>
>> When I query the field by the id I get:
>>
>> <result name="response" numFound="1" start="0">
>> -
>> <doc>
>> <str name="id">5405255</str>
>> <str name="textTitle">#######'s test blog</str>
>> </doc>
>> </result>
>>
>> If I try to query even multiple ### I get nothing.
>>
>> Here is what luke handler says:  (btw when I used id instead of  
>> docid on
>> luke I got a nullpointer exception  /admin/luke?docid=5405255  vs
>> /admin/luke?id=5405255)
>>
>> <lst name="textTitle">
>> <str name="type">text_rev</str>
>> <str name="schema">ITS-----------</str>
>> <str name="index">ITS----------</str>
>> <int name="docs">290329</int>
>> <int name="distinct">401016</int>
>> -
>> <lst name="topTerms">
>> <int name="#1;golb">49362</int>
>> <int name="blog">49362</int>
>> <int name="#1;ecapsym">29426</int>
>> <int name="myspace">29426</int>
>> <int name="#1;s">8773</int>
>> <int name="s">8773</int>
>> <int name="#1;ed">8033</int>
>> <int name="de">8033</int>
>> <int name="com">6884</int>
>> <int name="#1;moc">6884</int>
>> </lst>
>> -
>> <lst name="histogram">
>> <int name="1">308908</int>
>> <int name="2">34340</int>
>> <int name="4">21916</int>
>> <int name="8">14474</int>
>> <int name="16">9122</int>
>> <int name="32">5578</int>
>> <int name="64">3162</int>
>> <int name="128">1844</int>
>> <int name="256">910</int>
>> <int name="512">464</int>
>> <int name="1024">182</int>
>> <int name="2048">72</int>
>> <int name="4096">26</int>
>> <int name="8192">12</int>
>> <int name="16384">2</int>
>> <int name="32768">2</int>
>> <int name="65536">2</int>
>> </lst>
>> </lst>
>>
>>
>> solr/select?q=textTitle:%23%23%23  - gets no results.
>>
>> I have the same field indexed as a alphaOnlySort, and it gives me  
>> lots of
>> results, but not the ones I want.
>>
>> Any other ideas?
>>
>> thanks
>> Joel
>>
>>
>>
>> On Dec 7, 2009, at 3:42 PM, Erick Erickson wrote:
>>
>> Well, the very first thing I would is examine the field definition in
>>> your schema file. I suspect that the tokenizers and/or
>>> filters you're using for indexing and/or querying is doing something
>>> to the # symbol. Most likely stripping it. If you're just searching
>>> for the single-letter term "#", I *think* the query parser  
>>> silently just
>>> drops that part of the clause out, but check on that.....
>>>
>>> The second thing would be to get a copy of Luke and examine your
>>> index to see if what you *think* is in your index actually is there.
>>>
>>> HTH
>>> Erick
>>>
>>> On Mon, Dec 7, 2009 at 3:28 PM, Joel Nylund <jnylund@yahoo.com>  
>>> wrote:
>>>
>>> ok thanks,  sorry my brain wasn't working, but even when I url  
>>> encode it,
>>>> I
>>>> dont get any results, is there something special I have to do for  
>>>> solr?
>>>>
>>>> thanks
>>>> Joel
>>>>
>>>>
>>>> On Dec 7, 2009, at 3:20 PM, Paul Libbrecht wrote:
>>>>
>>>> Sure you have to escape it! %23
>>>>
>>>>>
>>>>> otherwise the browser considers it as a separator between the  
>>>>> URL for
>>>>> the
>>>>> server (on the left) and the fragment identifier (on the right)  
>>>>> which is
>>>>> not
>>>>> sent the server.
>>>>>
>>>>> You might want to read about "URL-encoding", escaping with  
>>>>> backslash is
>>>>> a
>>>>> shell-thing, not a thing for URLs!
>>>>>
>>>>> paul
>>>>>
>>>>>
>>>>> Le 07-déc.-09 à 21:16, Joel Nylund a écrit :
>>>>>
>>>>> Hi,
>>>>>
>>>>>>
>>>>>> How can I put a # sign in a query, do I need to escape it?
>>>>>>
>>>>>> For example I want to query books with title that contain #
>>>>>>
>>>>>> No work so far:
>>>>>> http://localhost:8983/solr/select?q=textTitle:"#"
>>>>>> http://localhost:8983/solr/select?q=textTitle:#
>>>>>> http://localhost:8983/solr/select?q=textTitle:"\#"
>>>>>>
>>>>>> Getting
>>>>>> org.apache.lucene.queryParser.ParseException: Cannot parse
>>>>>> 'textTitle:\':
>>>>>> Lexical error at line 1, column 12.  Encountered: <EOF> after
:  
>>>>>> ""
>>>>>>
>>>>>> and sometimes just no response.
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>> Joel
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>


Mime
View raw message