hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Szita (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves
Date Fri, 15 Dec 2017 11:13:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292354#comment-16292354
] 

Adam Szita commented on HIVE-18263:
-----------------------------------

Thanks for reviewing [~zsombor.klara], and committing [~pvary]!

This morning after the change was committed I've restarted ptest2, and I've been monitoring
its log:
It started by killing off the existing 12 slaves it was using before, and simultaneously creating
12 new ones, all according to plan:
{code}
2017-12-15 09:44:41 INFO  [localhost-startStop-1] ExecutionController:85 - Reading configuration
from file: /opt/apache-tomcat-7.0.72/conf/cloudhost.properties
2017-12-15 09:44:46 INFO  [localhost-startStop-1] CloudExecutionContextProvider:130 - CloudExecutionContextProvider
maxHostsPerCreateRequest = 2
2017-12-15 09:44:46 INFO  [localhost-startStop-1] CloudExecutionContextProvider:421 - Requesting
termination of [35.225.33.208, 104.198.248.189, 35.184.205.41, 35.184.14.247, 35.192.52.184,
104.154.229.171, 35.192.216.79, 35.224.189.104, 35.184.147.31, 35.225.218.206, 104.198.217.87,
35.224.37.167]
{code}
One hour later the caretaker thread saw the following IP's in {{mTerminatedHosts}}:
{code}
2017-12-15 10:44:51 INFO  [CloudExecutionContextProvider-BackgroundWorker] CloudExecutionContextProvider:340
- Performing background work
2017-12-15 10:44:51 INFO  [CloudExecutionContextProvider-BackgroundWorker] CloudExecutionContextProvider:345
- Currently tracked terminated hosts: [35.225.33.208, 35.184.205.41, 104.154.229.171, 35.192.216.79,
35.184.147.31, 35.225.218.206, 35.224.37.167]
{code}
..and it did nothing after, killed no slaves.

All of these 7 IPs can be found in the list of 12 old slaves above (they were killed at startup).
So without this change it is very likely that 5 slaves would've been killed during test execution
unnecessarily.
(e.g. host with IP 104.198.217.87 is not part of mTerminatedHosts, but it is running right
now, and it reuses an IP from the old group)

All-in-all I believe this is working now as it should.

> Ptest execution are multiple times slower sometimes due to dying executor slaves
> --------------------------------------------------------------------------------
>
>                 Key: HIVE-18263
>                 URL: https://issues.apache.org/jira/browse/HIVE-18263
>             Project: Hive
>          Issue Type: Bug
>          Components: Testing Infrastructure
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>             Fix For: 3.0.0
>
>         Attachments: HIVE-18263.0.patch, HIVE-18263.1.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. Usually it
should take about 1.5 hours, but in some cases it took over 4-5 hours.
> Looking in the logs of one such execution I've seen that some commands that were sent
to test executing slaves returned 255. Here this typically means that there is unknown return
code for the remote call since hiveptest-server can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while running the
job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of slaves. It
also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed ([mTerminatedHosts
map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of hosts that
were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the file are
seen as running slaves, ptest will terminate these first so it can begin with a fresh start
> ** The IPs of these terminated instances already make their way into {{mTerminatedHosts}}
upon initialization...
> * The cloud provider may reuse some older IPs, so it is not too rare that the same IP
that belonged to a terminated host is assigned to a new one.
> This is problematic: Hive ptest's slave caretaker thread kicks in every 60 minutes and
might see a running host that has the same IP as an old slave had which was terminated at
startup. It will think that this host should be terminated since it already tried 60 minutes
ago as its IP is in {{mTerminatedHosts}}
> We have to fix this by making sure that if a new slave is created, we check the contents
of {{mTerminatedHosts}} and remove this IP from it if it is there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message