lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: SolrCloud is sick.
Date Sun, 03 Nov 2019 13:58:59 GMT
>From a credentials standpoint:

Yonik and I built 90% of it originally and then I spent years on it with
few other devs or users.

Pretty sure I'm the only one that has ever had 95%+ of the Solr test suites
work in under 10-15 seconds consistently - 4000 tests across like 1000
suites. Got them all to run in parallel in under 5 minutes vs the 20-45 it
takes on a good day after tons of other speed ups and fixes I've already
done.

I doubt there are many insane enough that have deep dived and pushed around
the entire code base for 2-3 weeks multiple times, 16-20 hours a day. That
has spent the last decade beyond that stupid time almost exclusively on
this system. Designing it with Yonik, building it, fixing it, helping
people with it, monitoring it, responding to pages and escalations for it.

That has spent half a year replacing the entire decade old build for Lucene
and Solr and all it's various nooks and crannies.

If anyone has spent more time on this system or pushing it around on a
large scale or or has seen it in 100x the shape it is now more than once,
please speak up, you are in charge, I follow you. I'm not that bright, if
you have done the ground work, default to you.

Otherwise, I don't even have much confidence anyone else even knows this
system remotely well. All that time and effort and the most I know of it is
what awful awful shape its in and the bad trend direction.


- Mark

On Sun, Nov 3, 2019 at 7:35 AM Mark Miller <markrmiller@gmail.com> wrote:

> Personally, I believe the latter so strongly, if I can’t convince the
> others in the raft with me, I’m jumping in and swimming to another raft
> after my entire adult life here.
>
> Mark
>
> On Sun, Nov 3, 2019 at 7:30 AM Mark Miller <markrmiller@gmail.com> wrote:
>
>> In fact this will be a fundamental difference some of us are about to
>> split between.
>>
>> Those that think they can ever fix the tests or the system or the 1000s
>> of bugs we have and keep adding due to our current world view of
>> making tests fit the system not the system fit the tests and that fact that
>> everything is so slow and retry and workaround that stupid shit works all
>> over. It's all deep. It's ingrained. It grown over for a decade.Its a
>> project of 60 modules.
>>
>> Soon we will split between those that think they are making progress
>> across the ocean and those that think we are sitting in shark infested
>> waters waiting to die actually, starting to float backwards sometimes now.
>>
>> - Mark
>>
>> On Sun, Nov 3, 2019 at 7:23 AM Mark Miller <markrmiller@gmail.com> wrote:
>>
>>> bq.  They also would allow it to do it in an iterative manner without
>>> changing everything at once.
>>>
>>> Sadly, you can't fix this piece by piece :) I dare anyone to try. I
>>> encourage, I applaud the effort.
>>>
>>> The world is your oyster from a good spot - take your pick of how to do
>>> things.
>>>
>>> But from this spot, if anyone thinks we are getting out design change by
>>> design change, JIRA by JIRA, I'm so sorry. Let's commiserate in a couple
>>> years on a beer when you  give up on that.
>>>
>>> - Mark
>>>
>>> On Sun, Nov 3, 2019 at 4:01 AM Jörn Franke <jornfranke@gmail.com> wrote:
>>>
>>>> I cannot say anything about the statements, but maybe it could help to
>>>> introduce Solr Improvement Proposals (SIP) similar to Kafka Improvement
>>>> Proposals (KIP) or Flink Improvement Proposals (FLIP).
>>>>
>>>>   I think they are helpful to facilitate design decisions and
>>>> refactoring / redesign decision. They also would allow it to do it in an
>>>> iterative manner without changing everything at once.
>>>> The final version could be out  in The Git of Solr in markdown
>>>> including figures presenting parts of the design.
>>>>
>>>> However for developing them I propose a more inclusive approach where
>>>> many people (not only core developers) can easily comment and support, eg
>>>> Google docs or similar.
>>>>
>>>> > Am 03.11.2019 um 06:39 schrieb Noble Paul <noble.paul@gmail.com>:
>>>> >
>>>> > Solr has to do more than Lucene. A Lucene user is mostly a developer
>>>> > who reads javadocs. A Solr user's touch points are
>>>> >
>>>> > * Public API
>>>> > * Ref guide
>>>> > * publicly visible files (in ZK as well as file system)
>>>> > * What to see/look for in the log files to debug issues
>>>> >
>>>> > Then we have more nuanced touch points such as the knowledge base of
>>>> > what happens internally in the system when 'X' API is invoked or when
>>>> > 'Y' behavior is observed in ZK data.
>>>> >
>>>> > The problem with delaying the review process till code completion is
>>>> > that, any changes based on review comments will require massive amount
>>>> > of work.
>>>> >
>>>> > I don't have an answer to how we achieve it. But, I clearly see this
>>>> > as a major gap in our development process today.
>>>> >
>>>> > This discussion may not be relevant in this thread, may be because no
>>>> > behavior is changed at all. We don't know yet
>>>> >
>>>> > What I want to believe is Mark is doing the right thing & it's gonna
>>>> > help us all in dealing with our operational issues. I don't want to
>>>> > interrupt his work with more discussions.
>>>> >
>>>> > Thanks you
>>>> >
>>>> >
>>>> >> On Sun, Nov 3, 2019 at 3:32 PM David Smiley <
>>>> david.w.smiley@gmail.com> wrote:
>>>> >>
>>>> >> Yeah we do a bad job of the things you listed Noble.  :-(   My
>>>> colleagues want pointers to internal docs but the sad reality is there
>>>> isn't any.  You may notice I'm a stickler in my code reviews for requiring
>>>> javadocs on all top level classes.  I think more javadocs and code comments
>>>> would be very helpful -- especially for the major classes.  This might help
>>>> us all and others a lot more.  For example I think Lucene does a rather
>>>> fine job of this for its major classes -- IndexWriter being a good example.
>>>> >>
>>>> >> ~ David Smiley
>>>> >> Apache Lucene/Solr Search Developer
>>>> >> http://www.linkedin.com/in/davidwsmiley
>>>> >>
>>>> >>
>>>> >>> On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <noble.paul@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I believe there is a consensus on what is wrong with the way
we
>>>> have built the cluster state and overseer. We need to focus a bit more on
>>>> the design aspect. Design, according to me, has the following elements:
>>>> >>>
>>>> >>> * How does it work?
>>>> >>>
>>>> >>> * What are the performance characteristics? Can it be done more
>>>> efficiently?
>>>> >>>
>>>> >>> * What are the public touch points?
>>>> >>>
>>>> >>> ** Which are the files we store in ZK? Are they expected to
be
>>>> watched always?
>>>> >>>
>>>> >>> ** Or are they read on demand?
>>>> >>>
>>>> >>> ** The public APIs. Does it make sense to the user? Can it be
>>>> further simplified? How does it compare to the other APIs in the system?
>>>> >>>
>>>> >>>
>>>> >>> We, as a community, do a bad job in dealing with these. While
we
>>>> focus on internal things, these are not discussed before it is too late.
We
>>>> usually do coding, tests, code review (sometimes) and commit. This leads
to
>>>> huge technical debt.
>>>> >>>
>>>> >>>
>>>> >>> This is not to put blame on one person or a group of people.
(I
>>>> occasionally see people discussing design issues upfront, I just hope that
>>>> is the norm.)
>>>> >>>
>>>> >>>
>>>> >>> Now, why am I discussing this in this thread?
>>>> >>>
>>>> >>>
>>>> >>> While we agree there are problems, we are trying to solve the
>>>> problem using the same process we used to create these problems. Again, I'm
>>>> not questioning the intent or competence of anyone. Unless we set the
>>>> process right, we are doomed to make the same mistakes again.
>>>> >>>
>>>> >>>
>>>> >>> I whole heartedly endorse any effort to improve SolrCloud/overseer.
>>>> At the same time I fail to see us leveraging the collective experience of
>>>> our community through meaningful discussion.
>>>> >>>
>>>> >>>
>>>> >>> I hope we don't resort to personal attacks and use this as an
>>>> opportunity to improve our processes.
>>>> >>> Thanks
>>>> >>>
>>>> >>> On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dragonsinth@gmail.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> Very much agreed.  I've been trying to figure out for a
long time
>>>> what is the point in having a replica DOWN state that has to be toggled
>>>> (DOWN and then UP!) every time a node restarts.  Considering that we could
>>>> just combine ACTIVE and `live_nodes` to understand whether a replica is
>>>> available.  It's not even foolproof since kill -9 on a solr node won't mark
>>>> all the replicas DOWN-- that doesn't happen until the node comes back up
>>>> (perversely).
>>>> >>>>
>>>> >>>> What would it take to get to a state where restarting a
node would
>>>> require a minimal amount of ZK work in most cases?
>>>> >>>>
>>>> >>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <markrmiller@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> Give me a short bit to follow up and I will lay out
my case and
>>>> proposal.
>>>> >>>>>
>>>> >>>>> Everyone is then free to decide that we need to do something
>>>> drastic or that I'm wrong and we should just continue down the same road.
>>>> If that's the case, a lot of your work will get a lot easier and less
>>>> impeded by me and we will still all be happier. Win win.
>>>> >>>>>
>>>> >>>>> If we can just not make drastic changes for a just a
brief week
>>>> or so window, I'll say what I have to say, you guys can judge and do
>>>> whatever you'd please.
>>>> >>>>>
>>>> >>>>> - mark
>>>> >>>>>
>>>> >>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <markrmiller@gmail.com>
>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> Hey All Solr Dev's,
>>>> >>>>>>
>>>> >>>>>> SolrCloud is sick right now. The way low level Zookeeper
is
>>>> handeled, the Overseer, is mix and mess of proper exception handling and
>>>> super slow startup and shutdown, adding new things all the time with no
>>>> concern for performance or proper ordering (which is harder to tell than
>>>> you think).
>>>> >>>>>>
>>>> >>>>>> Our class dependency graph doesn't even work - we
just force it.
>>>> Sort of. If the whole system  doesn't block and choke it's way to a start
>>>> slow enough, lots of things fail.
>>>> >>>>>>
>>>> >>>>>> This thing coughs up, you toss stuff into the storm,
a good
>>>> chunk of time, what you want eventually come back without causing too much
>>>> damage.
>>>> >>>>>>
>>>> >>>>>> There are so many things are are off or just plain
wrong and the
>>>> list is growing and growing. No one is following this or if you are, please
>>>> back me up. This thing will collapse under it's own wait.
>>>> >>>>>>
>>>> >>>>>> So if you want to add yet another state format cluster
state or
>>>> some other optimization on this junk heap, you can expect me to push back.
>>>> >>>>>>
>>>> >>>>>> We should all be embarrassed by the state of things.
>>>> >>>>>>
>>>> >>>>>> I've got some ideas for addressing them that I'll
share soon,
>>>> but god, don't keep optimizing a turd in non backcompat Overseer loving
>>>> ways. That Overseer is an atrocity.
>>>> >>>>>>
>>>> >>>>>> --
>>>> >>>>>> - Mark
>>>> >>>>>>
>>>> >>>>>> http://about.me/markrmiller
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> --
>>>> >>>>> - Mark
>>>> >>>>>
>>>> >>>>> http://about.me/markrmiller
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > -----------------------------------------------------
>>>> > Noble Paul
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>>
>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
> - Mark
>
> http://about.me/markrmiller
>


-- 
- Mark

http://about.me/markrmiller

Mime
View raw message