lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!
Date Mon, 01 Dec 2008 02:38:46 GMT
Hi Fergie,

Haven't forgotten about you, but I've been traveling and then into  
some US Holidays here.

To confirm I am understanding, you are seeing a slowdown between 1.3- 
dev from April and one from September, right?

Can you produce an MD5 hash of the WAR file or something, such that I  
can know I have the exact bits.  Better yet, perhaps you can put those  
files up somewhere where they can be downloaded.

Thanks,
Grant

On Nov 26, 2008, at 10:54 AM, Fergus McMenemie wrote:

> Hello Grant,
>
> Not much good with Java profilers (yet!) so I thought I
> would send a script!
>
> Details... details! Having decided to produce a script to
> replicate the 1.2 vis 1.3 speed problem. The required rigor
> revealed a lot more.
>
> 1) The faster version I have previously referred to as 1.2,
>   was actually a "1.3-dev" I had downloaded as part of the
>   solr bootcamp class at ApacheCon Europe 2008. The ID
>   string in the CHANGES.txt document is:-
>   $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $
>
> 2) I did actually download and speed test a version of 1.2
>   from the internet. It's CHANGES.txt id is:-
>   $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $
>   Speed wise it was about the same as 1.3 at 64min. It also
>   had lots of char set issues and is ignored from now on.
>
> 3) The version I was planning to use, till I found this,
>   speed issue was the "latest" official version:-
>   $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $
>   I also verified the behavior with a nightly build.
>   $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $
>
> Anyway, The following script indexes the content in 22min
> for the 1.3-dev version and takes 68min for the newer releases
> of 1.3. I took the conf directory from the 1.3dev (bootcamp)
> release and used it replace the conf directory from the
> official 1.3 release. The 3x slow down was still there; it is
> not a configuration issue!
> =================================
>
>
>
>
>
>
> #! /bin/bash
>
> # This script assumes a /usr/local/tomcat link to whatever version
> # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also
> # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml.
> # All the following was done as root.
>
>
> # I have a directory /usr/local/ts which contains four versions of  
> solr. The
> # "official" 1.2 along with two 1.3 releases and a version of 1.2 or  
> a 1.3beata
> # I got while attending a solr bootcamp. I indexed the same content  
> using the
> # different versions of solr as follows:
> cd /usr/local/ts
> if [ "" ]
> then
>   echo "Starting from a-fresh"
>   sleep 5 # allow time for me to interrupt!
>   cp -Rp apache-solr-bc/example/solr      ./solrbc  #bc = bootcamp
>   cp -Rp apache-solr-nightly/example/solr ./solrnightly
>   cp -Rp apache-solr-1.3.0/example/solr   ./solr13
>
>   # the gaz is regularly updated and its name keeps changing :-) The  
> page
>   # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to  
> the latest
>   # version.
>   curl "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip 
> " > geonames.zip
>   unzip -q geonames.zip
>   # delete corrupt blips!
>   perl -i -n -e 'print unless
>       ($. > 2128495 and $. < 2128505) or
>       ($. > 5944254 and $. < 5944260)
>       ;' geonames_dd_dms_date_20081118.txt
>   #following was used to detect bad short records
>   #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F),"  
> args\n" if (@F != 26);' geonames_dd_dms_date_20081118.txt
>
>   # my set of fields and copyfields for the schema.xml
>   fields='
>   <fields>
>      <field name="UNI"           type="string" indexed="true"   
> stored="true" required="true" />
>      <field name="CCODE"         type="string" indexed="true"   
> stored="true"/>
>      <field name="DSG"           type="string" indexed="true"   
> stored="true"/>
>      <field name="CC1"           type="string" indexed="true"   
> stored="true"/>
>      <field name="LAT"           type="sfloat" indexed="true"   
> stored="true"/>
>      <field name="LONG"          type="sfloat" indexed="true"   
> stored="true"/>
>      <field name="MGRS"          type="string" indexed="false"  
> stored="true"/>
>      <field name="JOG"           type="string" indexed="false"  
> stored="true"/>
>      <field name="FULL_NAME"     type="string" indexed="true"   
> stored="true"/>
>      <field name="FULL_NAME_ND"  type="string" indexed="true"   
> stored="true"/>
>      <!--field name="text"       type="text"   indexed="true"   
> stored="false" multiValued="true"/ -->
>      <!--field name="timestamp"  type="date"   indexed="true"   
> stored="true"  default="NOW" multiValued="false"/-->
>   '
>   copyfields='
>      </fields>
>      <copyField source="FULL_NAME" dest="text"/>
>      <copyField source="FULL_NAME_ND" dest="text"/>
>   '
>
>   # add in my fields and copyfields
>   perl -i -p -e "print qq($fields) if s/<fields>//;"           solr*/ 
> conf/schema.xml
>   perl -i -p -e "print qq($copyfields) if s[</fields>][];"     solr*/ 
> conf/schema.xml
>   # change the unique key and mark the "id" field as not required
>   perl -i -p -e "s/<uniqueKey>id/<uniqueKey>UNI/i;"            solr*/ 
> conf/schema.xml
>   perl -i -p -e 's/required="true"//i if m/<field name="id"/;' solr*/ 
> conf/schema.xml
>   # enable remote streaming in solrconfig file
>   perl -i -p -e 's/enableRemoteStreaming="false"/ 
> enableRemoteStreaming="true"/;' solr*/conf/solrconfig.xml
>   fi
>
> # some constants to keep the curl command shorter
> skip 
> = 
> "MODIFY_DATE 
> ,RC 
> ,UFI 
> ,DMS_LAT 
> ,DMS_LONG 
> ,FC,PC,ADM1,ADM2,POP,ELEV,CC2,NT,LC,SHORT_FORM,GENERIC,SORT_NAME"
> file=`pwd`"/geonames.txt"
>
> export JAVA_OPTS=" -Xmx512M -Xms512M -Dsolr.home=`pwd`/solr - 
> Dsolr.solr.home=`pwd`/solr"
>
> echo 'Getting ready to index the data set using solrbc (bc =  
> bootcamp)'
> /usr/local/tomcat/bin/shutdown.sh
> sleep 15
> if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
>   then
>   echo "Tomcat would not shutdown"
>   exit
>   fi
> rm -r /usr/local/tomcat/webapps/solr*
> rm -r /usr/local/tomcat/logs/*.out
> rm -r /usr/local/tomcat/work/Catalina/localhost/solr
> cp apache-solr-bc/example/webapps/solr.war /usr/local/tomcat/webapps
> rm solr # rm the symbolic link
> ln -s solrbc solr
> rm -r solr/data
> /usr/local/tomcat/bin/startup.sh
> sleep 10 # give solr time to launch and setup
> echo "Starting indexing at " `date` " with solrbc (bc = bootcamp)"
> time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip

> "
>
> echo "Getting ready to index the data set using solrnightly"
> /usr/local/tomcat/bin/shutdown.sh
> sleep 15
> if [ -n "`ps awxww | grep tomcat | grep -v grep`" ]
>   then
>   echo "Tomcat would not shutdown"
>   exit
>   fi
> rm -r /usr/local/tomcat/webapps/solr*
> rm -r /usr/local/tomcat/logs/*.out
> rm -r /usr/local/tomcat/work/Catalina/localhost/solr
> cp apache-solr-nightly/example/webapps/solr.war /usr/local/tomcat/ 
> webapps
> rm solr # rm the symbolic link
> ln -s solrnightly solr
> rm -r solr/data
> /usr/local/tomcat/bin/startup.sh
> sleep 10 # give solr time to launch and setup
> echo "Starting indexing at " `date` " with solrnightly"
> time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip

> "
>
>
>
>
>> On Nov 20, 2008, at 9:18 AM, Fergus McMenemie wrote:
>>
>>> Hello Grant,
>>>
>>>> Were you overwriting the existing index or did you also clean out  
>>>> the
>>>> Solr data directory, too?  In other words, was it a fresh index, or
>>>> an
>>>> existing one?  And was that also the case for the 22 minute time?
>>>
>>> No in each case it was a new index. I store the indexes (the "data"
>>> dir)
>>> outside the solr home directory. For the moment I, rm -rf the index
>>> dir
>>> after each edit to the solrconfig.sml or schema.xml file and reindex
>>> from scratch. The relaunch of tomcat recreates the index dir.
>>>
>>>> Would it be possible to profile the two instance and see if you
>>>> notice
>>>> anything different?
>>> I dont understand this. Do mean run a profiler against the tomcat
>>> image as indexing takes place, or somehow compare the indexes?
>>
>> Something like JProfiler or any other Java profiler.
>>
>>>
>>>
>>> I was think of making a short script that replicates the results,
>>> and posting it here, would that help?
>>
>>
>> Very much so.
>>
>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> Grant
>>>>
>>>> On Nov 19, 2008, at 8:25 AM, Fergus McMenemie wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a CSV file with 6M records which took 22min to index with
>>>>> solr 1.2. I then stopped tomcat replaced the solr stuff inside
>>>>> webapps with version 1.3, wiped my index and restarted tomcat.
>>>>>
>>>>> Indexing the exact same content now takes 69min. My machine has
>>>>> 2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M - 
>>>>> Xms512M.
>>>>>
>>>>> Are there any tweaks I can use to get the original index time
>>>>> back. I read through the release notes and was expecting a
>>>>> speed up. I saw the bit about increasing ramBufferSizeMB and set
>>>>> it to 64MB; it had no effect.
>>>>> -- 
>
> -- 
>
> ===============================================================
> Fergus McMenemie               Email:fergus@twig.me.uk
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











Mime
View raw message