ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vyacheslav Daradur <daradu...@gmail.com>
Subject Re: Nodes which started in separate JVM couldn't stop properly (in tests)
Date Thu, 15 Mar 2018 15:48:43 GMT
In brief:
Nodes in *separate* JVMs are shutting down by the computing task
*StopGridTask* which has sent from *local* JVM *synchronously* that
means *local* node must wait for task's finish.

At the same time when a node in *separate* JVM executes the received
*StopGridTask* which *synchronously* calls *G.stop(igniteInstanceName,
FALSE)* which is waiting for all computing task's finish, including
*StopGridTask* which has invoked it.

We have some kind of deadlock:
*Local* node is waiting for the computing task's finish which is
waiting for finish of execution *G.stop* which is waiting for all
computing tasks finish including *StopGridTask*.

We have not noticed that before because we use only stopAllGrids() in
out tests which stop local JVM without waiting for nodes in other
JVMs.



On Thu, Mar 15, 2018 at 6:11 PM, Dmitry Pavlov <dpavlov.spb@gmail.com> wrote:
> Please address comments in PR.
>
> I did not fully understood why sync GridStopMessage message was lost, but
> async will be successfull. Probably we need discuss it briefly.
>
> чт, 1 мар. 2018 г. в 12:11, Vyacheslav Daradur <daradurvs@gmail.com>:
>>
>> Thank you, Dmitry!
>>
>> I'll join this review soon.
>>
>> On Thu, Mar 1, 2018 at 12:07 PM, Dmitry Pavlov <dpavlov.spb@gmail.com>
>> wrote:
>> > Hi Vyacheslav,
>> >
>> > I will take a look, but first of all I am going to review
>> > https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502  - it is
>> > impact
>> > change in testing framework. Hope you also will join to this review .
>> >
>> > Sincerely,
>> > Dmitiry Pavlov
>> >
>> >
>> > чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <daradurvs@gmail.com>:
>> >>
>> >> Hi, Dmitry, could you please review it, because you are one of the
>> >> most experienced people in the testing framework.
>> >>
>> >> Please see comment in Jira, because it is in pretty-format there.
>> >>
>> >> On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
>> >> <daradurvs@gmail.com> wrote:
>> >> > Hi Igniters!
>> >> >
>> >> > I have investigated the issue [1] and found that stopping node in
>> >> > separate JVM may stuck thread or leave system process alive after
>> >> > test
>> >> > finished.
>> >> > The main reason is *StopGridTask* that we send from node in local JVM
>> >> > to node in separate JVM via remote computing.
>> >> > We send job synchronously to be sure that node will be stopped, but
>> >> > job calls synchronously *G.stop(igniteInstanceName, cancel))* with
>> >> > *cancel = false*, that means node must wait to compute jobs before
it
>> >> > goes down what leads to some kind of deadlock. Using of *cancel =
>> >> > true* would solve the issue but may break some tests’ logic, for
this
>> >> > reason, I've reworked the method’s synchronization logic [2].
>> >> >
>> >> > We have not noticed that before because we use only *stopAllGrids()*
>> >> > in out tests which stop local JVM without waiting for nodes in other
>> >> > JVMs.
>> >> > I believe this fix should reduce the number of flaky tests on
>> >> > TeamCity, especially which fails because of a cluster from the
>> >> > previous test has not been stopped properly.
>> >> >
>> >> > Ci.tests [3] look a bit better than in master.
>> >> > Please review prepared PR [2] and share your thoughts.
>> >> >
>> >> > [1] https://issues.apache.org/jira/browse/IGNITE-5910
>> >> > [2] https://github.com/apache/ignite/pull/2382
>> >> > [3] https://ci.ignite.apache.org/viewLog.html?buildId=1105939
>> >> >
>> >> >
>> >> > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur
>> >> > <daradurvs@gmail.com> wrote:
>> >> >> Hi Igniters,
>> >> >>
>> >> >> Working on my task I found a bug at call the method #stopGrid(name),
>> >> >> it produced ClassCastException. I created a ticket[1].
>> >> >>
>> >> >> After it was fixed[2] I saw that nodes which was started in a
>> >> >> separate
>> >> >> JVM
>> >> >> could stay in process of operation system.
>> >> >> It was fixed too, but not sure is it fixed in proper way or not.
>> >> >>
>> >> >> Could someone review it?
>> >> >>
>> >> >> [1] https://issues.apache.org/jira/browse/IGNITE-5910
>> >> >> [2] https://github.com/apache/ignite/pull/2382
>> >> >>
>> >> >> --
>> >> >> Best Regards, Vyacheslav D.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best Regards, Vyacheslav D.
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Vyacheslav D.
>>
>>
>>
>> --
>> Best Regards, Vyacheslav D.



-- 
Best Regards, Vyacheslav D.

Mime
View raw message