lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emir Arnautović <emir.arnauto...@sematext.com>
Subject Re: Solr node is out of sync (looks Healthy)
Date Tue, 13 Feb 2018 10:07:30 GMT
Hi Daniel,
Back to your original question. What is the diff between doc number on replicas - a few docs
or large number? My assumption is that you don’t have autocommit enabled and that you commit
explicitly when indexing is done, and somehow on some replica(s) commit is processed before
all docs are indexed. 
Some inline comments.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 13 Feb 2018, at 10:22, Daniel Carrasco <d.carrasco@i2tic.com> wrote:
> 
> Hello,
> 
> I answer inline ;)
> 
> 2018-02-12 23:56 GMT+01:00 Emir Arnautović <emir.arnautovic@sematext.com <mailto:emir.arnautovic@sematext.com>>:
> 
>> Hi Daniel,
>> Please see inline comments.
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 12 Feb 2018, at 13:13, Daniel Carrasco <d.carrasco@i2tic.com> wrote:
>>> 
>>> Hello,
>>> 
>>> 2018-02-12 12:32 GMT+01:00 Emir Arnautović <emir.arnautovic@sematext.com
>> <mailto:emir.arnautovic@sematext.com <mailto:emir.arnautovic@sematext.com>>>:
>>> 
>>>> Hi Daniel,
>>>> Maybe it is Monday and I am still not warmed up, but your details seems
>> a
>>>> bit imprecise to me. Maybe not directly related to your problem, but
>> just
>>>> to exclude that you are having some strange Solr setup, here is my
>>>> understanding: You are running a single SolrCloud cluster with 8 nodes.
>> It
>>>> has a single collection with X shards and Y replicas. You use DIH to
>> index
>>>> data and you use curl to interact with Solr and start DIH process. You
>> see
>>>> some of replicas for some of shards having less data and after node
>> restart
>>>> it ends up being ok.
>>> 
>>> 
>>>> Is this right? If it is, what is X and Y?
>>> 
>>> 
>>> Near to reality:
>>> 
>>>  - I've a SolrCloud cluster with 8 nodes but has multiple collections.
>>>  - Every collection has only one shard for performance purpose (I did
>>>  some test splitting shards and queries were slower).
>> Distributed request comes with an overhead and if collection is small,
>> that overhead can be larger than benefit of parallelising search.
>> 
>>>  - Every collection has 8 replicas (one by node)
>> I would compare all shards on all nodes (64 Solr cores) v.s. having just
>> one replica (16 Solr cores)
>> 
> 
> We've all shards on all nodes because we want to avoid the overhead of send
> data between nodes (latency, network traffic). The page has a lot of
> petitions per second and we want the fastest response posible with HA if
> some nodes fails.
If you are using Solrj in your middle layer, you can initialize it with ZK and it will be
aware where are collections and send directly to nodes with shards on it.

> 
> Also we've 8 collections: five are small (less than 15Mb), and three of
> then has some GB (the bigger is about 10Gb).
> Is the first SolrCloud I've created and I'd decided this architecture to
> avoid what I say above, the overhead of sent the data between nodes when
> the client ask for data to another node. Maybe is better to have a
> replication factor of 3-4 for example and create shards for big collections?
Only test can tell if splitting large collection will give some benefits. If you are happy
with your query latency, then you don’t have to split.

> 
> 
>> 
>>>  - After restart the node it start to recover the collections. I don't
>>>  know if Solr serve data directly on that state or get the data from
>> other
>>>  nodes before serve it, but even while is recovering, the data looks OK.
>> Recovery can be from transaction logs (logs can tell) and that would mean
>> that there was no hard commit after some updates.
>> 
>>> 
>>> 
>>> 
>>>> Do you have autocommit set up or you commit explicitly?
>>> 
>>> 
>>> I'm not sure about that. How I can check it?
>> It is part of solr.xml
>> 
> 
> I've checked the file and looks like there's no configuration about
> autocommit. I’ll search a bit about how works to see if can help.
> 
I might have pointed to the wrong filename - solrconfig.xml. Under updateHandler you should
find autoCommit and autoSoftCommit.

> 
>> 
>>> 
>>> On curl command is not specified, but will be true by default, right?
>> I think it is for full import.
>> 
>>> 
>>> 
>>> 
>>>> Did you check logs on node with less data and did you see any
>>>> errors/warnings?
>>> 
>>> 
>>> I'm not sure when it failed and the cluster has a lot warnings and error
>>> every time (maybe related with queries from shop), so is hard to
>> determine
>>> if import error exists and what's the error related to the import. Is
>> like
>>> search a needle on a haystack
>> Not with some logging solution - one such is Sematext’s Logsene:
>> https://sematext.com/logsene/ <https://sematext.com/logsene/> <https://sematext.com/logsene/
<https://sematext.com/logsene/>>
>> 
>>> 
>>> 
>>>> Do you do full imports or incremental imports?
>>>> 
>>> 
>>> I've checked the curl command and looks like is doing full imports
>> without
>>> clean data:
>>> http://' <http://'/> <http://'/ <http://'/>> . $solr_ip .
>>> ':8983/solr/descriptions/dataimport?command=full-
>> import&clean=false&entity=description_’.$idm[$j].'_lastupdate
>> This is not a good strategy since Solr does not have real updates - it is
>> delete + insert and deleted documents are purged on segment merges. Also
>> this will not eliminate documents that are deleted in the primary storage.
>> It is much better to index it in new collection and use aliases to point to
>> used collection. This way you can even roll back if not happy with new
>> index.
>> 
>> 
> But if you update three products for example and you create a new
> collection with that updates, how you point that three products to original
> collection?, or you've to reindex the whole collection on a new collection
> and then create an alias?,
You create alias first and point to some existing collection and reconfigure your app to use
alias. Then when doing full-import, you create new collection, do import, verify results and
point alias to that new collection. You can keep old collection so you can switch back to
it, or delete it. Note after initial reconfiguration of your app, this switching is transparent
to your app.

> because also I don't know if is a good idea to
> reindex a whole collection of 10Gb every 5 minutes to create another
> collection with the updates.
I got impression that it is the only way you do updates. Do you do direct updates as well?

> Also you've to manage the way to delete the
> old collections, to avoid to fill the disk.
You can delete entire collection at any moment once alias is pointing to a new collection.

> 
> 
> 
>>> 
>>> 
>>>> 
>>>> Not related to issue, but just a note that Solr does not guaranty
>>>> consistency at any time - it has something called “eventual
>> consistency” -
>>>> once updates stop all replicas will (should) end up in the same state.
>>>> Having said that, using Solr results directly in your UI would either
>>>> require you to cache used documents on UI/middle layer or implement some
>>>> sort of stickiness or retrieve only ID from Solr and load data from
>> primary
>>>> storage. If you have static data, and you update index once a day, you
>> can
>>>> use aliases and switch between new and old index and you will suffer
>> from
>>>> this issue only at the time when doing switches.
>>>> 
>>> 
>>> But is normal that data will be inconsistent for a very long time?,
>> because
>>> looks like the data is inconsistent from about a week…
>> It will become consistent once all changes are committed and searchers
>> reopened.
>> 
>>> 
>>> Another question: With HDFS, data will be consistent?. With HDFS the data
>>> will be shared between nodes and then updates will be avaible on all
>> nodes
>>> at same time, right?
>> I am not too familiar with running Solr on HDFS, but I doubt that it is
>> working the way you expect it to work. You might be able to have multiple
>> Solr instances (not part of the same cluster - can be standalone Solr)
>> reading from the same HDFS directory and one updating it, but you would
>> probably have to reload core on read instances to be aware of changes. Not
>> sure if you would get much out of it - just different replication
>> mechanism. But it is late here and I never used Solr on HDFS so take this
>> with a grain of salt.
>> 
> 
> Today I've read a comment that said is like standalone server and it
> creates a copy of the data for every node that has replica. Maybe is wrong,
> but is not what I want.
> I want to have a SolrCluster like now but sharing data through HDFS and
> with the ability of autoscaling if there's too much load (AWS and GCP
> autoscaling group), but maybe is like search unicorns…
It is probably the best to start a new thread with HDFS questions.

> 
> 
>>> 
>>> Thanks!!
>>> 
>>> 
>>>> 
>>>> Regards,
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
<http://sematext.com/>
>>>> 
>>>> 
>>>> 
>>>>> On 12 Feb 2018, at 12:00, Daniel Carrasco <d.carrasco@i2tic.com <mailto:d.carrasco@i2tic.com>>
>> wrote:
>>>>> 
>>>>> Hello, thanks for your help.
>>>>> 
>>>>> I answer bellow.
>>>>> 
>>>>> Greetings!!
>>>>> 
>>>>> 2018-02-12 11:31 GMT+01:00 Emir Arnautović <
>> emir.arnautovic@sematext.com <mailto:emir.arnautovic@sematext.com>
>>>> <mailto:emir.arnautovic@sematext.com <mailto:emir.arnautovic@sematext.com>
<mailto:emir.arnautovic@
>> sematext.com <http://sematext.com/>>>>:
>>>>> 
>>>>>> Hi Daniel,
>>>>>> Can you tell us more about your document update process. How do you
>>>> commit
>>>>>> changes? Since it got fixed after restart, it seems to me that on
that
>>>> one
>>>>>> node index searcher was not reopened after updates. Do you see any
>>>>>> errors/warnings on that node?
>>>>>> 
>>>>> 
>>>>> i've asked to the programmers and looks like they are using the
>>>> collections
>>>>> dataimport using curl. I think the data is imported from a Microsoft
>> SQL
>>>>> server using a solr plugin.
>>>>> 
>>>>> 
>>>>>> Also, what do you mean by “All nodes are standalone”?
>>>>>> 
>>>>> 
>>>>> I mean that nodes don't share filesystem (I'm planning to use Hadoop,
>> but
>>>>> I've to learn to create and maintain the cluster first). All nodes has
>>>> its
>>>>> own data drive and are connected to the cluster using zookeeper.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Emir
>>>>>> --
>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>>> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/ <http://sematext.com/>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 12 Feb 2018, at 11:16, Daniel Carrasco <d.carrasco@i2tic.com
<mailto:d.carrasco@i2tic.com>>
>>>> wrote:
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> We're using Solr to manage products data on our shop and the
last
>> week
>>>>>> some
>>>>>>> customers called us telling that price between shop and shopping
>> basket
>>>>>>> differs. After research a bit I've noticed that it happens sometimes
>> on
>>>>>>> page refresh.
>>>>>>> After disabling all cache I've queried all solr instances to
see if
>>>> data
>>>>>> is
>>>>>>> correct and I've seen that one of them give a different price
for the
>>>>>>> product, so looks like the instance has not got the updated data.
>>>>>>> 
>>>>>>> - How is possible that a node on a cluster have different data?
>>>>>>> - How i can check if data is in sync?, because the cluster looks
al
>>>>>>> healthy on admin, and the node is active and OK.
>>>>>>> - Is there any way to detect this error? and How I can force
resyncs?
>>>>>>> 
>>>>>>> After restart the node it got synced, so the data now is OK,
but I
>>>> can't
>>>>>>> restart the nodes every time to see if data is right (it tooks
a lot
>> of
>>>>>>> time to be synced again).
>>>>>>> 
>>>>>>> My configuration is: 8 Solr nodes using v7.1.0 and zookeeper
v3.4.11.
>>>> All
>>>>>>> nodes are standalone (I'm not using hadoop).
>>>>>>> 
>>>>>>> Thanks and greetings!
>>>>>>> --
>>>>>>> _________________________________________
>>>>>>> 
>>>>>>>   Daniel Carrasco Marín
>>>>>>>   Ingeniería para la Innovación i2TIC, S.L.
>>>>>>>   Tlf:  +34 911 12 32 84 Ext: 223 <+34%20911%2012%2032%2084>
>>>>>>>   www.i2tic.com <http://www.i2tic.com/> <http://www.i2tic.com/
<http://www.i2tic.com/>> <http://www.i2tic.com/ <http://www.i2tic.com/>
<
>> http://www.i2tic.com/ <http://www.i2tic.com/>>>
>>>>>>> _________________________________________
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> _________________________________________
>>>>> 
>>>>>    Daniel Carrasco Marín
>>>>>    Ingeniería para la Innovación i2TIC, S.L.
>>>>>    Tlf:  +34 911 12 32 84 Ext: 223
>>>>>    www.i2tic.com <http://www.i2tic.com/> <http://www.i2tic.com/
<http://www.i2tic.com/>> <http://www.i2tic.com/ <http://www.i2tic.com/>
<
>> http://www.i2tic.com/ <http://www.i2tic.com/>>>
>>>>> _________________________________________
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> _________________________________________
>>> 
>>>     Daniel Carrasco Marín
>>>     Ingeniería para la Innovación i2TIC, S.L.
>>>     Tlf:  +34 911 12 32 84 Ext: 223
>>>     www.i2tic.com <http://www.i2tic.com/> <http://www.i2tic.com/ <http://www.i2tic.com/>>
>>> _________________________________________
>> 
>> 
> Thanks!!
> 
> -- 
> _________________________________________
> 
>      Daniel Carrasco Marín
>      Ingeniería para la Innovación i2TIC, S.L.
>      Tlf:  +34 911 12 32 84 Ext: 223
>      www.i2tic.com <http://www.i2tic.com/>
> _________________________________________


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message