ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vyacheslav Daradur <daradu...@gmail.com>
Subject Re: Nodes which started in separate JVM couldn't stop properly (in tests)
Date Fri, 23 Mar 2018 09:58:18 GMT
Dmitry, Nikolay, thanks for your help!

Since the changes are merged we are able to implement this feature in
Compatibility Testing Framework.

It provides us to work flexibly with multi-version cluster and
implement new testing scenarios (I know that Ignite doesn't have such
feature yet).

Does it make sense to create a ticket?


On Wed, Mar 21, 2018 at 8:16 PM, Dmitry Pavlov <dpavlov.spb@gmail.com> wrote:
> Hi Nickolay,
>
> it seems we have lazy consesus here.
>
> Failures:  tests 11 suites 1, all these tests are failed in master.
>
> Could you merge?
>
> Sincerely,
> Dmitriy Pavlov
>
> пт, 16 мар. 2018 г. в 18:29, Nikolay Izhikov <nizhikov@apache.org>:
>
>> Hello, Guys.
>>
>> I'm reviewed changes and it looks good to me.
>> There is a simple reproducer for a bug in test framework, see below.
>>
>> It fails in master and works in branch.
>>
>> I'm planning to merge the fix [1] if Run All will be OK.
>>
>> Please, write to me if you have any objections.
>>
>> [1] https://github.com/apache/ignite/pull/2382
>>
>> ```
>> public class MultiJvmSelfTest extends GridCommonAbstractTest {
>>     @Override protected boolean isMultiJvm() { return true; }
>>
>>     public void testGrid() throws Exception {
>>         final IgniteInternalFuture fut = GridTestUtils.runAsync(new
>> RunnableX() {
>>             @Override public void runx() throws Exception {
>>                 try {
>>                     startGrid(0);
>>                     startGrid(1);
>>                 }
>>                 finally {
>>                     stopGrid(1);
>>                     stopGrid(0);
>>                 }
>>             }
>>         });
>>
>>         try {
>>             fut.get(20_000L);
>>         } finally {
>>             stopAllGrids(true);
>>         }
>>     }
>> }
>> ```
>>
>> В Чт, 15/03/2018 в 15:59 +0000, Dmitry Pavlov пишет:
>> > I see now. Thank you.
>> >
>> > Nikolay, could you please merge this change?
>> >
>> > чт, 15 мар. 2018 г. в 18:48, Vyacheslav Daradur <daradurvs@gmail.com>:
>> >
>> > > In brief:
>> > > Nodes in *separate* JVMs are shutting down by the computing task
>> > > *StopGridTask* which has sent from *local* JVM *synchronously* that
>> > > means *local* node must wait for task's finish.
>> > >
>> > > At the same time when a node in *separate* JVM executes the received
>> > > *StopGridTask* which *synchronously* calls *G.stop(igniteInstanceName,
>> > > FALSE)* which is waiting for all computing task's finish, including
>> > > *StopGridTask* which has invoked it.
>> > >
>> > > We have some kind of deadlock:
>> > > *Local* node is waiting for the computing task's finish which is
>> > > waiting for finish of execution *G.stop* which is waiting for all
>> > > computing tasks finish including *StopGridTask*.
>> > >
>> > > We have not noticed that before because we use only stopAllGrids() in
>> > > out tests which stop local JVM without waiting for nodes in other
>> > > JVMs.
>> > >
>> > >
>> > >
>> > > On Thu, Mar 15, 2018 at 6:11 PM, Dmitry Pavlov <dpavlov.spb@gmail.com>
>> > > wrote:
>> > > > Please address comments in PR.
>> > > >
>> > > > I did not fully understood why sync GridStopMessage message was
>> lost, but
>> > > > async will be successfull. Probably we need discuss it briefly.
>> > > >
>> > > > чт, 1 мар. 2018 г. в 12:11, Vyacheslav Daradur <daradurvs@gmail.com
>> >:
>> > > > >
>> > > > > Thank you, Dmitry!
>> > > > >
>> > > > > I'll join this review soon.
>> > > > >
>> > > > > On Thu, Mar 1, 2018 at 12:07 PM, Dmitry Pavlov <
>> dpavlov.spb@gmail.com>
>> > > > > wrote:
>> > > > > > Hi Vyacheslav,
>> > > > > >
>> > > > > > I will take a look, but first of all I am going to review
>> > > > > > https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502
 -
>> it is
>> > > > > > impact
>> > > > > > change in testing framework. Hope you also will join to
this
>> review .
>> > > > > >
>> > > > > > Sincerely,
>> > > > > > Dmitiry Pavlov
>> > > > > >
>> > > > > >
>> > > > > > чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <
>> daradurvs@gmail.com>:
>> > > > > > >
>> > > > > > > Hi, Dmitry, could you please review it, because you
are one of
>> the
>> > > > > > > most experienced people in the testing framework.
>> > > > > > >
>> > > > > > > Please see comment in Jira, because it is in pretty-format
>> there.
>> > > > > > >
>> > > > > > > On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
>> > > > > > > <daradurvs@gmail.com> wrote:
>> > > > > > > > Hi Igniters!
>> > > > > > > >
>> > > > > > > > I have investigated the issue [1] and found that
stopping
>> node in
>> > > > > > > > separate JVM may stuck thread or leave system
process alive
>> after
>> > > > > > > > test
>> > > > > > > > finished.
>> > > > > > > > The main reason is *StopGridTask* that we send
from node in
>> local
>> > >
>> > > JVM
>> > > > > > > > to node in separate JVM via remote computing.
>> > > > > > > > We send job synchronously to be sure that node
will be
>> stopped, but
>> > > > > > > > job calls synchronously *G.stop(igniteInstanceName,
>> cancel))* with
>> > > > > > > > *cancel = false*, that means node must wait to
compute jobs
>> before
>> > >
>> > > it
>> > > > > > > > goes down what leads to some kind of deadlock.
Using of
>> *cancel =
>> > > > > > > > true* would solve the issue but may break some
tests’ logic,
>> for
>> > >
>> > > this
>> > > > > > > > reason, I've reworked the method’s synchronization
logic [2].
>> > > > > > > >
>> > > > > > > > We have not noticed that before because we use
only
>> > >
>> > > *stopAllGrids()*
>> > > > > > > > in out tests which stop local JVM without waiting
for nodes
>> in
>> > >
>> > > other
>> > > > > > > > JVMs.
>> > > > > > > > I believe this fix should reduce the number of
flaky tests on
>> > > > > > > > TeamCity, especially which fails because of a
cluster from
>> the
>> > > > > > > > previous test has not been stopped properly.
>> > > > > > > >
>> > > > > > > > Ci.tests [3] look a bit better than in master.
>> > > > > > > > Please review prepared PR [2] and share your thoughts.
>> > > > > > > >
>> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5910
>> > > > > > > > [2] https://github.com/apache/ignite/pull/2382
>> > > > > > > > [3]
>> https://ci.ignite.apache.org/viewLog.html?buildId=1105939
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur
>> > > > > > > > <daradurvs@gmail.com> wrote:
>> > > > > > > > > Hi Igniters,
>> > > > > > > > >
>> > > > > > > > > Working on my task I found a bug at call
the method
>> > >
>> > > #stopGrid(name),
>> > > > > > > > > it produced ClassCastException. I created
a ticket[1].
>> > > > > > > > >
>> > > > > > > > > After it was fixed[2] I saw that nodes which
was started
>> in a
>> > > > > > > > > separate
>> > > > > > > > > JVM
>> > > > > > > > > could stay in process of operation system.
>> > > > > > > > > It was fixed too, but not sure is it fixed
in proper way
>> or not.
>> > > > > > > > >
>> > > > > > > > > Could someone review it?
>> > > > > > > > >
>> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5910
>> > > > > > > > > [2] https://github.com/apache/ignite/pull/2382
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > Best Regards, Vyacheslav D.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Best Regards, Vyacheslav D.
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > Best Regards, Vyacheslav D.
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Best Regards, Vyacheslav D.
>> > >
>> > >
>> > >
>> > > --
>> > > Best Regards, Vyacheslav D.
>> > >



-- 
Best Regards, Vyacheslav D.

Mime
View raw message