lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <thelabd...@gmail.com>
Subject Re: SolrCloud MatchAllDocsQuery returning different number of docs each request
Date Thu, 02 Aug 2012 20:42:12 GMT
Yes, I can but won't get to it today unfortunately. I had my eval
environment running on some very expensive EC2 instances and shut it
down for the time being until I can focus on it again. Will try to get
back to this either tomorrow or over the weekend. Sorry for the delay.

Tim

On Thu, Aug 2, 2012 at 1:35 PM, Mark Miller <markrmiller@gmail.com> wrote:
> Can you do me a favor and try not using the batch add for a run?
>
> Just do the add one doc at a time. (solrServer.add(doc) rather than solrServer.add(collection))
>
> I just fixed one issue with it this morning on trunk - it may be the cause of this oddity.
>
> I'm also working on some performance issues around that method too (good performance
without starting thousands of threads).
>
> Until I get all that straightened out (hopefully very soon), I think you will have better
luck not using the bulk, collection add method.
>
> On Aug 2, 2012, at 2:16 PM, Timothy Potter <thelabdude@gmail.com> wrote:
>
>> Thanks Mark.
>>
>> I'm actually using SolrJ 3.4.0, so using CommonsHttpSolrServer:
>>
>> Collection<SolrInputDocument> batch = ...
>> ... build up batch ...
>> solrServer.add( batch );
>>
>> Basically, I have a custom Pig StoreFunc that sends docs to Solr from
>> our Hadoop analytics nodes. The reason I'm not using SolrJ 4.0.0-ALPHA
>> is that I couldn't get it to run in my Hadoop environment. There's
>> some classpath conflict with the Apache HttpClient. SolrJ 4 depends on
>> 4.1.3 but when I run it in my env, I get the following:
>>
>> Caused by: java.lang.NoSuchMethodError:
>> org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager: method
>> <init>()V not found
>>       at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:94)
>>       at org.apache.solr.client.solrj.impl.CloudSolrServer.<init>(CloudSolrServer.java:70)
>>       ... 16 more
>>
>> I spent hours trying to resolve the classpath issue and finally had to
>> bail and just used the 3.4 SolrJ client as I'm just at the evaluation
>> stage at this point. So it sounds like this could be the cause of my
>> problems.
>>
>> One other thing ... I do have the _version_ field defined in my
>> schema.xml but am not setting it on the client side when indexing.
>> Should I be doing that?
>>
>> Cheers,
>> Tim
>>
>>
>> On Thu, Aug 2, 2012 at 11:27 AM, Mark Miller <markrmiller@gmail.com> wrote:
>>>
>>> On Aug 2, 2012, at 11:08 AM, Timothy Potter <thelabdude@gmail.com> wrote:
>>>
>>>> Just starting to get into SolrCloud using 4.0.0-ALPHA and am very
>>>> impressed so far ...
>>>>
>>>> I have a 12-shard index with ~104M docs with each shard having
>>>> 1-replica (so 24 Solr servers running)
>>>>
>>>> Using the Query form on the Admin panel, I issue the MatchAllDocsQuery
>>>> (*:*) and each time I send the request the value for numFound in the
>>>> result is different. It's always close but not exactly the same as I
>>>> would expect? Can anyone shed some light on this issue? I also tried a
>>>> real query, such as "#olympics lochte" and same thing - different
>>>> numFound each time. The first page of actual docs returned is the same
>>>> so maybe I should just ignore the numFound issue?
>>>>
>>>> Note that while experiencing this behavior, I am not adding any docs
>>>> to the index and all docs have been committed with waitFlush=true and
>>>> waitSearcher=true on the commit. Also, not doing soft commits at this
>>>> point. In addition, after having committed all 104M docs, I hit the
>>>> optimize button the panel so I have only 1 segment. In other words,
>>>> the index is not being updated and has been optimized at this point.
>>>
>>>
>>> How are you adding docs? Eg what client and what method in particular (what is
your line of code that actually adds the doc).
>>>
>>> You can find the numFound result for each node by passing the param distrib=false.
What does this tell you? Are your replicas in sync with the leader? What does the count for
each shard add up to?
>>>
>>> I would not ignore the issue - something must be off. It may somehow be user
error, it may be a bug that has been fixed since the alpha, or it may be something new.
>>>
>>> Are you sure every shard you are issuing the query *from* is active and live
according to ZooKeeper? Eg when you look at the cloud admin view and look at the cluster visualization,
are all the nodes green?
>>>
>>> - Mark Miller
>>> lucidimagination.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message