lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Bizarre TFV output
Date Mon, 21 Jun 2010 19:06:58 GMT
: 
: <field name="text_t" type="text" indexed="true" stored="true"
: multiValued="true" termVectors="true" termPositions="true"
: termOffsets="true"/> 
: 
: It uses the text field type as its defined in Solr schema. I didn't
: change it.

which version of Solr? (the schema is just an example, and the field types 
in the example schema change between versions as new analysis 
components are added and best practices are re-evaluated)

: The input text is a 6 page UTF-8 text document, the relevant line the
: term seems to be related to. Just a sentence with no specific
: boundaries.

Did you try pasting that text into the analysis page to see exactly what 
your "text_t" field does with it at analysis time like ia suggested?

My best hunch is that the "spaces" are not your typical basic "space" 
character (hex 20) and maybe the tokenizer you are using doesn't tokenize 
on them, but then perhaps something like word delimiter treats them as 
non-word characters and chews them up.

but that's just a guess ... w/o knowing the exact fieldtype analyzer and 
the specific Unicode characters used in the text it's just a guess.

(Tip: if you use the JSON response writer (wt=json) when looking at the 
stored field value, it will help you see exactly what characters were in 
the original values by showing you the unicode escapes)


-Hoss


Mime
View raw message