lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Spam <ps...@mac.com>
Subject Re: Solr searching performance issues, using large documents
Date Wed, 21 Jul 2010 21:41:38 GMT
>From the mailing list archive, Koji wrote:

> 1. Provide another field for highlighting and use copyField to copy plainText to the
highlighting field.

and Lance wrote: http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html

> If you want to highlight field X, doing the termOffsets/termPositions/termVectors will
make highlighting that field faster. You should make a separate field and apply these options
to that field.
> 
> Now: doing a copyfield adds a "value" to a multiValued field. For a text field, you get
a multi-valued text field. You should only copy one value to the highlighted field, so just
copyField the document to your special field. To enforce this, I would add multiValued="false"
to that field, just to avoid mistakes.
> 
> So, all_text should be indexed without the term* attributes, and should not be stored.
Then your document stored in a separate field that you use for highlighting and has the term*
attributes.

I've been experimenting with this, and here's what I've tried:

   <field name="body" type="text_pl" indexed="true" stored="false" multiValued="true" termVectors="true"
termPositions="true" termOff
sets="true" />
   <field name="body_all" type="text_pl" indexed="false" stored="true" multiValued="true"
/>
   <copyField source="body" dest="body_all"/>

... but it's still very slow (10+ seconds).  Why is it better to have two fields (one indexed
but not stored, and the other not indexed but stored) rather than just one field that's both
indexed and stored?


>From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors

> If you aren't always using all the stored fields, then enabling lazy field loading can
be a huge boon, especially if compressed fields are used.

What does this mean?  How do you load a field lazily?

Thanks for your time, guys - this has started to become frustrating, since it works so well,
but is very slow!


-Pete

On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:

> Data set: About 4,000 log files (will eventually grow to millions).  Average log file
is 850k.  Largest log file (so far) is about 70MB.
> 
> Problem: When I search for common terms, the query time goes from under 2-3 seconds to
about 60 seconds.  TermVectors etc are enabled.  When I disable highlighting, performance
improves a lot, but is still slow for some queries (7 seconds).  Thanks in advance for any
ideas!
> 
> 
> -Peter
> 
> 
> -------------------------------------------------------------------------------------------------------------------------------------
> 
> 4GB RAM server
> % java -Xms2048M -Xmx3072M -jar start.jar
> 
> -------------------------------------------------------------------------------------------------------------------------------------
> 
> schema.xml changes:
> 
>    <fieldType name="text_pl" class="solr.TextField">
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> 	<filter class="solr.LowerCaseFilterFactory"/> 
> 	<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>      </analyzer>
>    </fieldType>
> 
> ...
> 
>   <field name="body" type="text_pl" indexed="true" stored="true" multiValued="false"
termVectors="true" termPositions="true" termOffsets="true" />
>    <field name="timestamp" type="date" indexed="true" stored="true" default="NOW"
multiValued="false"/>
>   <field name="version" type="string" indexed="true" stored="true" multiValued="false"/>
>   <field name="device" type="string" indexed="true" stored="true" multiValued="false"/>
>   <field name="filename" type="string" indexed="true" stored="true" multiValued="false"/>
>   <field name="filesize" type="long" indexed="true" stored="true" multiValued="false"/>
>   <field name="pversion" type="int" indexed="true" stored="true" multiValued="false"/>
>   <field name="first2md5" type="string" indexed="false" stored="true" multiValued="false"/>
>   <field name="ckey" type="string" indexed="true" stored="true" multiValued="false"/>
> 
> ...
> 
> <dynamicField name="*" type="ignored" multiValued="true" />
> <defaultSearchField>body</defaultSearchField>
> <solrQueryParser defaultOperator="AND"/>
> 
> -------------------------------------------------------------------------------------------------------------------------------------
> 
> solrconfig.xml changes:
> 
>    <maxFieldLength>2147483647</maxFieldLength>
>    <ramBufferSizeMB>128</ramBufferSizeMB>
> 
> -------------------------------------------------------------------------------------------------------------------------------------
> 
> The query:
> 
> rowStr = "&rows=10"
> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
> termvectors = "&tv=true&qt=tvrh&tv.all=true"
> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
> regexv = "(?m)^.*\n.*\n.*$"
> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1')
+ fuzzy + minLogSizeStr)
> 
> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s)
) + justq + rowStr + facet + fields + termvectors + hl + hl_regex
> 
> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + p['rows'].to_s
+ '&minLogSize=' + p['minLogSize'].to_s
> 


Mime
View raw message