lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl / Cominvent <>
Subject Re: Missing tokens
Date Wed, 18 Aug 2010 10:55:08 GMT

Can you share with us how your schema looks for this field? What FieldType? What tokenizer
and analyser?
How do you parse the PDF document? Before submitting to Solr? With what tool?
How do you do the query? Do you get the same results when doing the query from a browser,
not SolrJ?

Jan Høydahl, search solution architect
Cominvent AS -
Training in Europe -

On 18. aug. 2010, at 11.34, wrote:

> Hi, I'm having a problem with certain search terms not being found when I
> do a query. I'm using Solrj to index a pdf document, and add the contents
> to the 'contents' field. If I query the 'contents' field on the
> SolrInputDocument doc object as below, I get 50k tokens.
> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
> "contents"));
> System.out.println( "Tokens:"  + to.countTokens() );
> However, once the doc is indexed and I use Luke to analyse the index, it
> has only 3300 tokens in that field. Where did the other 47k go?
> I read some other threads mentioning to increase the maxfieldLength in
> solrconfig.xml, and my setting is below.
>  <maxFieldLength>2147483647</maxFieldLength>
> Any advice is appreciated,
> Paul

View raw message