airavata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eroma (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures
Date Tue, 13 Nov 2018 20:05:00 GMT
Eroma created AIRAVATA-2943:
-------------------------------

             Summary: Re-queueing and node failures in HPC clusters need to be handled in
gateway middleware as resubmitting failures 
                 Key: AIRAVATA-2943
                 URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
             Project: Airavata
          Issue Type: Bug
          Components: helix implementation
    Affects Versions: 0.18
         Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 in Jetstream
            Reporter: Eroma
            Assignee: Dimuthu Upeksha
             Fix For: 0.18


Currently in clusters (PBS and SLURM) jobs are getting either re-queued due to node failures.
In such scenarios the jobs are been executed after re-queueing but on gateway side it is taken
as a FAILED job at the initial NODE_FAIL. 

These types of failures need to be captured as retrying failures instead of taking it as an
end result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message