hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grandl Robert <rgra...@yahoo.com.INVALID>
Subject save job history log file
Date Mon, 02 Jun 2014 23:51:24 GMT
Hi,

I am running hadoop 2.4. However, some of the jobs fail to log the job history file, throwing
the following exception:

14/06/02 14:07:50 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED.
Redirecting to job history server
......................
java.io.IOException: Job status not available 
    at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:322)
    at org.apache.hadoop.mapreduce.Job.getStatus(Job.java:329)
    at org.apache.hadoop.mapred.gridmix.JobMonitor$MonitorThread.run(JobMonitor.java:183)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)
.........................

The AM of the failed jobs, are throwing the following exception at the end of their lifetime,
when needs to finalize job history logging:

2014-05-23 18:36:20,598 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Calling handler for
JobFinishedEvent
2014-05-23 18:36:20,599 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1400892230783_0027Job
Transitioned from COMMITTING to SUCCEEDED
2014-05-23 18:36:20,614 INFO [Thread-2693]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: We are finishing cleanly so
this is the last retry
2014-05-23 18:36:20,614 INFO [Thread-2693]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator
isAMLastRetry: true
2014-05-23 18:36:20,614 INFO [Thread-2693]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator
notified that shouldUnregistered is: true
2014-05-23 18:36:20,614 INFO [Thread-2693] org.apache.hadoop.mapreduce.v2.app.MRAppMaster:
Notify JHEH isAMLastRetry: true
2014-05-23 18:36:20,614 INFO [Thread-2693]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler:
JobHistoryEventHandler notified that forceJobCompletion is true
2014-05-23 18:36:20,614 INFO [Thread-2693]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Calling stop for all the
services
2014-05-23 18:36:20,618 INFO [Thread-2693]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping
JobHistoryEventHandler. Size of the outstanding queue size is 2419
2014-05-23 18:36:20,618 INFO [Thread-2693]
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing
event REDUCE_ATTEMPT_STARTED
2014-05-23 18:36:20,619 ERROR [Thread-2693] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler:
Error writing History Event: org.apache.hadoop.mapreduce.jobhistory.TaskAttemptStartedEvent@1dd4406c
java.io.IOException:All datanodes 10.xx.xx.xx:50010 are bad. Aborting...
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1128)
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-05-23 18:36:20,620 INFO [Thread-2693]
org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler
failed in state STOPPED; cause:
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: ja
va.io.IOException: All datanodes10.xx.xx.xx:50010 are bad. Aborting...
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException:All datanodes
10.xx.xx.xx:50010 are bad. Aborting...
        at
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:525)
        at
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:335)
        at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        at
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
        at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
        at
org.apache.hadoop.service.CompositeService.stop(CompositeService.java:158)
        at
org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
        at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1471)
        at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1085)
        at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:554)
        at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:605)
Caused by: java.io.IOException: All datanodes 10.xx.xx.xx:50010 are bad.
Aborting...
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1128)
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-05-23 18:36:20,620 WARN [Thread-2693]
org.apache.hadoop.service.CompositeService: When stopping the service
JobHistoryEventHandler :
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException:
All datanodes 10.xx.xx.xx:50010 are bad. Aborting...
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException:
All datanodes 10.xx.xx.xx:50010 are bad. Aborting...
        at
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:525)
        at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:335)
        at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException:
All datanodes 10.200.148.33:50010 are bad. Aborting...
        at
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:525)
        at
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:335)
        at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        at
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
        at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
        at
org.apache.hadoop.service.CompositeService.stop(CompositeService.java:158)
        at
org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1471)
        at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1085)
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:554)
        at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:605)
Caused by: java.io.IOException: All datanodes 10.xx.xx.xx:50010 are bad.
Aborting...
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1128)
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-05-23 18:36:20,633 INFO [Thread-2693]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Setting job
diagnostics to
2014-05-23 18:36:20,648 INFO [Thread-2693] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
History url is http://xxx:19888/jobhistory/job/job_1400892230783_0027
2014-05-23 18:36:20,682 INFO [Thread-2693]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Waiting for
application to be successfully unregistered.
2014-05-23 18:36:21,683 INFO [Thread-2693]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Final Stats:
PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:1000 CompletedReds:1000 ContAlloc:2040 ContRel:18 HostLocal:800
RackLocal:201
2014-05-23 18:36:21,685 INFO [Thread-2693] org.apache.hadoop.mapreduce.v2.app.MRAppMaster:
Deleting staging directory hdfs://xxx:8020
/tmp/hadoop-yarn/staging/hadoop/.staging/job_1400892230783_0027
2014-05-23 18:36:21,687 INFO [Thread-2693] org.apache.hadoop.ipc.Server:
Stopping server on 38222
2014-05-23 18:36:21,689 INFO [IPC Server listener on 38222]
org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 38222
2014-05-23 18:36:21,690 INFO [IPC Server Responder]
org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2014-05-23 18:36:21,691 INFO [TaskHeartbeatHandler PingChecker]
org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler
thread interrupted
2014-05-23 18:36:21,693 INFO [Thread-2693]
org.apache.hadoop.service.AbstractService: Service
org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state STOPPED; cause:
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException:
All datanodes 10.xx.xx.xx:50010 are bad. Aborting...
 
Caused by: java.io.IOException: All datanodes 10.xx.xx.xx:50010 are bad.
Aborting...
               
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1128)
               
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
               
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
[hadoop@xxx ~]$ INFO [Thread-2693]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Exiting MR AppMaster..GoodBye!
////////////////////////////////

I have tried to play with various settings parameters such as: increased replication factor,
or dfs.client.block.write.retries or dfs.datanode.handler.count but it was useless. 

Is there any way to save the job history log locally instead of trying to be pushed to hdfs
?

I have tried to copy the file from JobHistoryEventHandler.java from fileMap -> metainfo
-> getHistoryFile(), however it seems it is incomplete at that point. 

//////////////////////////////
In addition, the culprit DataNode throws the following exception:

2014-05-28
17:51:38,040 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: xx:50010:DataXceiver
error processing WRITE_BLOCK operation src: /10.xx.xx.xx:32900 dst:
/10.xx.xx.xx:50010
java.io.IOException:
Premature EOF from inputStream
       
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
       
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
       
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
       
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
       
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
       
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
       
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
       
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
       
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
       
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
       
at java.lang.Thread.run(Thread.java:722)
2014-05-28
17:51:38,040 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: xx:50010:DataXceiver
error processing WRITE_BLOCK operation  src:
/10.xx.xx.xx:50443 dst: /10.xx.xx.xx:50010
java.io.IOException:
Premature EOF from inputStream
       
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
       
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
       
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
       
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
       
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
       
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
      
 at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
       
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
       
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
       
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
       
at java.lang.Thread.run(Thread.java:722)
2014-05-28
17:51:38,047 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception
for BP-1758984974-10.200.148.12-1389807587568:blk_1127162839_53635254
java.io.IOException:
Premature EOF from inputStream
       
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
       
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
       
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
       
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
     
  at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
       
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
       
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739)
       
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
       
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
       
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
       
at java.lang.Thread.run(Thread.java:722)
 
Also, there are some other
interesting errors, although not sure it is related. 
 
2014-05-28
17:51:37,998 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException
in BlockReceiver.run(): 
java.io.IOException:
Broken pipe
       
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
       
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
       
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
       
at sun.nio.ch.IOUtil.write(IOUtil.java:60)
       
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
       
at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
       
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
       
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
       
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
       
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
       
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
       
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
       
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1310)
       
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1250)
       
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1167)
       
at java.lang.Thread.run(Thread.java:722)
2014-05-28
17:51:37,999 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
checkDiskError: exception: 
java.io.IOException:
Broken pipe
       
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
       
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
 
      at
sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
       
at sun.nio.ch.IOUtil.write(IOUtil.java:60)
       
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
       
at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
       
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
       
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
       
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
       
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
       
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
       
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
       
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1310)
       
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1250)
       
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1167)
       
at java.lang.Thread.run(Thread.java:722)
2014-05-28
17:51:37,999 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Not checking
disk as checkDiskError was called on a network related exception
2014-05-28
17:51:37,999 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
PacketResponder:
BP-1758984974-10.200.148.12-1389807587568:blk_1127150470_53622876,
type=LAST_IN_PIPELINE, downstreams=0:[]
java.io.IOException:
Broken pipe
      
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
       
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
       
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
       
at sun.nio.ch.IOUtil.write(IOUtil.java:60)
       
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
       
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
       
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
       
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
       
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
       
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
       
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
       
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
       
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1310)
       
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1250)
       
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$P///////////////////////////////

Thanks,
robert
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message