uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lou DeGenaro (JIRA)" <...@uima.apache.org>
Subject [jira] [Comment Edited] (UIMA-5883) DUCC JobDriver (JD) may cause job to never process all work items if JobProcess (JP) is preempted
Date Fri, 05 Oct 2018 11:35:00 GMT

    [ https://issues.apache.org/jira/browse/UIMA-5883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639693#comment-16639693
] 

Lou DeGenaro edited comment on UIMA-5883 at 10/5/18 11:34 AM:
--------------------------------------------------------------

Modified code to employ new method ungetMetaMetaCas when requesting JP is known to be pre-empted
or otherwise being dismissed.

Tested fix by hacking code to occasionally fake requesting process as "down" and saw in logs
that CASes were requeued and that Job ran successfully to completion.

 

Example log entries:

05 Oct 2018 06:33:59,264 WARN ActionGet - T[36] engage seqNo=? remote=bluej420.bluej.net.10483.38
node=bluej420.bluej.net pid=10483 text=process discontinued
 05 Oct 2018 06:33:59,265 WARN ActionGet - T[36] ungetMetaMetaCas seqNo=? remote=bluej420.bluej.net.10483.38
node=bluej420.bluej.net pid=10483 userKey2056 5 15 7.output
 05 Oct 2018 06:34:05,060 WARN ActionGet - T[35] ungetMetaMetaCas seqNo=? remote=bluej420.bluej.net.10483.38
node=bluej420.bluej.net pid=10483 userKey3173 9 15 7.output
 05 Oct 2018 06:34:12,899 WARN ActionGet - T[37] ungetMetaMetaCas seqNo=? remote=bluej420.bluej.net.10483.38
node=bluej420.bluej.net pid=10483 userKey4510 13 15 7.output

 

Also fixed logging error where the "process discontinued" message is supposed to appear in
log just once on the first occasion, but code had it backwards showing all occasions except
the first!


was (Author: lou.degenaro):
Modified code to employ new method ungetMetaMetaCas when requesting JP is known to be pre-empted
or otherwise being dismissed.

Tested fix by hacking code to occasionally fake requesting process as "down" and saw in logs
that CASes were requeued and that Job ran successfully to completion.

 

Example log entries:

05 Oct 2018 06:33:59,264 WARN ActionGet - T[36] engage seqNo=? remote=bluej420.bluej.net.10483.38
node=bluej420.bluej.net pid=10483 text=process discontinued
05 Oct 2018 06:33:59,265 WARN ActionGet - T[36] ungetMetaMetaCas seqNo=? remote=bluej420.bluej.net.10483.38
node=bluej420.bluej.net pid=10483 userKey2056 5 15 7.output
05 Oct 2018 06:34:05,060 WARN ActionGet - T[35] ungetMetaMetaCas seqNo=? remote=bluej420.bluej.net.10483.38
node=bluej420.bluej.net pid=10483 userKey3173 9 15 7.output
05 Oct 2018 06:34:12,899 WARN ActionGet - T[37] ungetMetaMetaCas seqNo=? remote=bluej420.bluej.net.10483.38
node=bluej420.bluej.net pid=10483 userKey4510 13 15 7.output

 

Also fixed logging error where the "process discontinued" message is supposedto appear in
log just once on the first occasion, but code had it backwards showing all occasions except
the first!

> DUCC JobDriver (JD) may cause job to never process all work items if JobProcess (JP)
is preempted
> -------------------------------------------------------------------------------------------------
>
>                 Key: UIMA-5883
>                 URL: https://issues.apache.org/jira/browse/UIMA-5883
>             Project: UIMA
>          Issue Type: Bug
>          Components: DUCC
>            Reporter: Lou DeGenaro
>            Assignee: Lou DeGenaro
>            Priority: Major
>             Fix For: 2.2.3-Ducc
>
>
> Noticed on Apache DUCC demo that Job 14493 had work items total=10001, completed=9986,
dispatch=15, and made no further progress.  Looking in work-item-state.json we see the 9986
that have completed and can infer precisely those that did not.  Then looking in the JD log
for those not yet complete work items, we see entries similar to:
> 25 Sep 2018 23:28:28,042 WARN ActionGet - T[14] engage seqNo=? remote=uima-ducc-demo-6.8448.25
node=uima-ducc-demo-6 pid=8448 text=process discontinued
> Looking at the code, we see that under this condition that the JD has obtained a CAS
from the CR, but chooses not to give it to the requesting JP process since JD knows that the
requester has been targeted for termination (e.g. preempted).  But the JD forgets to put
the CAS back into the queue!  And therefore those CASes never get processed and the Job is
hung forevermore.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message