ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Pavlov <dpavlov....@gmail.com>
Subject Re: Nodes which started in separate JVM couldn't stop properly (in tests)
Date Wed, 21 Mar 2018 17:16:36 GMT
Hi Nickolay,

it seems we have lazy consesus here.

Failures:  tests 11 suites 1, all these tests are failed in master.

Could you merge?

Sincerely,
Dmitriy Pavlov

пт, 16 мар. 2018 г. в 18:29, Nikolay Izhikov <nizhikov@apache.org>:

> Hello, Guys.
>
> I'm reviewed changes and it looks good to me.
> There is a simple reproducer for a bug in test framework, see below.
>
> It fails in master and works in branch.
>
> I'm planning to merge the fix [1] if Run All will be OK.
>
> Please, write to me if you have any objections.
>
> [1] https://github.com/apache/ignite/pull/2382
>
> ```
> public class MultiJvmSelfTest extends GridCommonAbstractTest {
>     @Override protected boolean isMultiJvm() { return true; }
>
>     public void testGrid() throws Exception {
>         final IgniteInternalFuture fut = GridTestUtils.runAsync(new
> RunnableX() {
>             @Override public void runx() throws Exception {
>                 try {
>                     startGrid(0);
>                     startGrid(1);
>                 }
>                 finally {
>                     stopGrid(1);
>                     stopGrid(0);
>                 }
>             }
>         });
>
>         try {
>             fut.get(20_000L);
>         } finally {
>             stopAllGrids(true);
>         }
>     }
> }
> ```
>
> В Чт, 15/03/2018 в 15:59 +0000, Dmitry Pavlov пишет:
> > I see now. Thank you.
> >
> > Nikolay, could you please merge this change?
> >
> > чт, 15 мар. 2018 г. в 18:48, Vyacheslav Daradur <daradurvs@gmail.com>:
> >
> > > In brief:
> > > Nodes in *separate* JVMs are shutting down by the computing task
> > > *StopGridTask* which has sent from *local* JVM *synchronously* that
> > > means *local* node must wait for task's finish.
> > >
> > > At the same time when a node in *separate* JVM executes the received
> > > *StopGridTask* which *synchronously* calls *G.stop(igniteInstanceName,
> > > FALSE)* which is waiting for all computing task's finish, including
> > > *StopGridTask* which has invoked it.
> > >
> > > We have some kind of deadlock:
> > > *Local* node is waiting for the computing task's finish which is
> > > waiting for finish of execution *G.stop* which is waiting for all
> > > computing tasks finish including *StopGridTask*.
> > >
> > > We have not noticed that before because we use only stopAllGrids() in
> > > out tests which stop local JVM without waiting for nodes in other
> > > JVMs.
> > >
> > >
> > >
> > > On Thu, Mar 15, 2018 at 6:11 PM, Dmitry Pavlov <dpavlov.spb@gmail.com>
> > > wrote:
> > > > Please address comments in PR.
> > > >
> > > > I did not fully understood why sync GridStopMessage message was
> lost, but
> > > > async will be successfull. Probably we need discuss it briefly.
> > > >
> > > > чт, 1 мар. 2018 г. в 12:11, Vyacheslav Daradur <daradurvs@gmail.com
> >:
> > > > >
> > > > > Thank you, Dmitry!
> > > > >
> > > > > I'll join this review soon.
> > > > >
> > > > > On Thu, Mar 1, 2018 at 12:07 PM, Dmitry Pavlov <
> dpavlov.spb@gmail.com>
> > > > > wrote:
> > > > > > Hi Vyacheslav,
> > > > > >
> > > > > > I will take a look, but first of all I am going to review
> > > > > > https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502
 -
> it is
> > > > > > impact
> > > > > > change in testing framework. Hope you also will join to this
> review .
> > > > > >
> > > > > > Sincerely,
> > > > > > Dmitiry Pavlov
> > > > > >
> > > > > >
> > > > > > чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <
> daradurvs@gmail.com>:
> > > > > > >
> > > > > > > Hi, Dmitry, could you please review it, because you are
one of
> the
> > > > > > > most experienced people in the testing framework.
> > > > > > >
> > > > > > > Please see comment in Jira, because it is in pretty-format
> there.
> > > > > > >
> > > > > > > On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur
> > > > > > > <daradurvs@gmail.com> wrote:
> > > > > > > > Hi Igniters!
> > > > > > > >
> > > > > > > > I have investigated the issue [1] and found that stopping
> node in
> > > > > > > > separate JVM may stuck thread or leave system process
alive
> after
> > > > > > > > test
> > > > > > > > finished.
> > > > > > > > The main reason is *StopGridTask* that we send from
node in
> local
> > >
> > > JVM
> > > > > > > > to node in separate JVM via remote computing.
> > > > > > > > We send job synchronously to be sure that node will
be
> stopped, but
> > > > > > > > job calls synchronously *G.stop(igniteInstanceName,
> cancel))* with
> > > > > > > > *cancel = false*, that means node must wait to compute
jobs
> before
> > >
> > > it
> > > > > > > > goes down what leads to some kind of deadlock. Using
of
> *cancel =
> > > > > > > > true* would solve the issue but may break some tests’
logic,
> for
> > >
> > > this
> > > > > > > > reason, I've reworked the method’s synchronization
logic [2].
> > > > > > > >
> > > > > > > > We have not noticed that before because we use only
> > >
> > > *stopAllGrids()*
> > > > > > > > in out tests which stop local JVM without waiting
for nodes
> in
> > >
> > > other
> > > > > > > > JVMs.
> > > > > > > > I believe this fix should reduce the number of flaky
tests on
> > > > > > > > TeamCity, especially which fails because of a cluster
from
> the
> > > > > > > > previous test has not been stopped properly.
> > > > > > > >
> > > > > > > > Ci.tests [3] look a bit better than in master.
> > > > > > > > Please review prepared PR [2] and share your thoughts.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5910
> > > > > > > > [2] https://github.com/apache/ignite/pull/2382
> > > > > > > > [3]
> https://ci.ignite.apache.org/viewLog.html?buildId=1105939
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur
> > > > > > > > <daradurvs@gmail.com> wrote:
> > > > > > > > > Hi Igniters,
> > > > > > > > >
> > > > > > > > > Working on my task I found a bug at call the
method
> > >
> > > #stopGrid(name),
> > > > > > > > > it produced ClassCastException. I created a ticket[1].
> > > > > > > > >
> > > > > > > > > After it was fixed[2] I saw that nodes which
was started
> in a
> > > > > > > > > separate
> > > > > > > > > JVM
> > > > > > > > > could stay in process of operation system.
> > > > > > > > > It was fixed too, but not sure is it fixed in
proper way
> or not.
> > > > > > > > >
> > > > > > > > > Could someone review it?
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5910
> > > > > > > > > [2] https://github.com/apache/ignite/pull/2382
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best Regards, Vyacheslav D.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best Regards, Vyacheslav D.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards, Vyacheslav D.
> > >
> > >
> > >
> > > --
> > > Best Regards, Vyacheslav D.
> > >

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message