lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Improving indexing performance
Date Wed, 09 Oct 2013 00:22:10 GMT
queue size shouldn't really be too large, the whole point of
the concurrency is to keep from waiting around for the
communication with the server in a single thread. So having
a bunch of stuff backed up in the queue isn't buying you anything....

And you can always increase the memory allocated to the JVM
running SolrJ...

Erick

On Tue, Oct 8, 2013 at 5:29 AM, Matteo Grolla <matteo.grolla@gmail.com> wrote:
> Thanks Erik,
>         I think I have been able to exhaust a resource
>         if I split the data in 2 and upload it with 2 clients like benchmark 1.1 it takes
120s here the bottleneck it my LAN,
>         if I use a setting like benchmark 1 probably the bottleneck is the ramBuffer.
>
>         I'm going to buy a Gigabit ethernet cable so I can make a better test.
>
>         OutOfMemory error: it's the solrj client that crashes
>                 I'm using solr 4.2.1 and corresponding solrj client
>                 httpsolrserver works fine
>                 concurrentupdatesolrsever gives me problems, and I didn't understand
how to size the queuesize parameter optimally
>
>
> Il giorno 07/ott/2013, alle ore 14:03, Erick Erickson ha scritto:
>
>> Just skimmed, but the usual reason you can't max out the server
>> is that the client can't go fast enough. Very quick experiment:
>> comment out the server.add line in your client and run it again,
>> does that speed up the client substantially? If not, then the time
>> is being spent on the client.
>>
>> Or split your csv file into, say, 5 parts and run it from 5 different
>> PCs in parallel.
>>
>> bq:  I can't rely on auto commit, otherwise I get an OutOfMemory error
>> This shouldn't be happening, I'd get to the bottom of this. Perhaps simply
>> allocating more memory to the JVM running Solr.
>>
>> bq: committing every 100k docs gives worse performance
>> It'll be best to specify openSearcher=false for max indexing throughput
>> BTW. You should be able to do this quite frequently, 15 seconds seems
>> quite reasonable.
>>
>> Best,
>> Erick
>>
>> On Sun, Oct 6, 2013 at 12:19 PM, Matteo Grolla <matteo.grolla@gmail.com> wrote:
>>> I'd like to have some suggestion on how to improve the indexing performance on
the following scenario
>>> I'm uploading 1M docs to solr,
>>>
>>> every docs has
>>>        id: sequential number
>>>        title:  small string
>>>        date: date
>>>        body: 1kb of text
>>>
>>> Here are my benchmarks (they are all single executions, not averages from multiple
executions):
>>>
>>> 1)      using the updaterequesthandler
>>>        and streaming docs from a csv file on the same disk of solr
>>>        auto commit every 15s with openSearcher=false and commit after last document
>>>
>>>        total time: 143035ms
>>>
>>> 1.1)    using the updaterequesthandler
>>>        and streaming docs from a csv file on the same disk of solr
>>>        auto commit every 15s with openSearcher=false and commit after last document
>>>        <ramBufferSizeMB>500</ramBufferSizeMB>
>>>        <maxBufferedDocs>100000</maxBufferedDocs>
>>>
>>>        total time: 134493ms
>>>
>>> 1.2)    using the updaterequesthandler
>>>        and streaming docs from a csv file on the same disk of solr
>>>        auto commit every 15s with openSearcher=false and commit after last document
>>>        <mergeFactor>30</mergeFactor>
>>>
>>>        total time: 143134ms
>>>
>>> 2)      using a solrj client from another pc in the lan (100Mbps)
>>>        with httpsolrserver
>>>        with javabin format
>>>        add documents to the server in batches of 1k docs       ( server.add(
<collection> ) )
>>>        auto commit every 15s with openSearcher=false and commit after last document
>>>
>>>        total time: 139022ms
>>>
>>> 3)      using a solrj client from another pc in the lan (100Mbps)
>>>        with concurrentupdatesolrserver
>>>        with javelin format
>>>        add documents to the server in batches of 1k docs       ( server.add(
<collection> ) )
>>>        server queue size=20k
>>>    server threads=4
>>>        no auto-commit and commit every 100k docs
>>>
>>>        total time: 167301ms
>>>
>>>
>>> --On the solr server--
>>> cpu averages    25%
>>>        at best 100% for 1 core
>>> IO      is still far from being saturated
>>>        iostat gives a pattern like this (every 5 s)
>>>
>>>        time(s)         %util
>>>        100                     45,20
>>>        105                     1,68
>>>        110                     17,44
>>>        115                     76,32
>>>        120                     2,64
>>>        125                     68
>>>        130                     1,28
>>>
>>> I thought that using concurrentupdatesolrserver I was able to max cpu or IO but
I wasn't.
>>> With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get
an OutOfMemory error
>>> and I found that committing every 100k docs gives worse performance than auto
commit every 15s (benchmark 3 with httpsolrserver took 193515)
>>>
>>> I'd really like to understand why I can't max out the resources on the server
hosting solr (disk above all)
>>> And I'd really like to understand what I'm doing wrong with concurrentupdatesolrserver
>>>
>>> thanks
>>>
>

Mime
View raw message