hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: division by zero in getLocalPathForWrite()
Date Sun, 13 Jan 2013 16:39:38 GMT
I found this error again, see
https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/345/testReport/org.apache.hadoop.hbase.mapreduce/TestImportExport/testSimpleCase/

2013-01-12 11:53:52,809 WARN  [AsyncDispatcher event handler]
resourcemanager.RMAuditLogger(255): USER=jenkins	OPERATION=Application
Finished - Failed	TARGET=RMAppManager	RESULT=FAILURE	DESCRIPTION=App
failed with state: FAILED	PERMISSIONS=Application
application_1357991604658_0002 failed 1 times due to AM Container for
appattempt_1357991604658_0002_000001 exited with  exitCode: -1000 due
to: java.lang.ArithmeticException: / by zero
	at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:368)
	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115)
	at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForWrite(LocalDirsHandlerService.java:279)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:851)

.Failing this attempt.. Failing the
application.	APPID=application_1357991604658_0002
Here is related code:

        // Keep rolling the wheel till we get a valid path
        Random r = new java.util.Random();
        while (numDirsSearched < numDirs && returnPath == null) {
          long randomPosition = Math.abs(r.nextLong()) % totalAvailable;

My guess is that totalAvailable was 0, meaning dirDF was empty.

Please advise whether that scenario is possible.

Cheers

On Tue, Oct 30, 2012 at 9:33 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> Thanks for the investigation Kihwal.
>
> I will keep an eye on future test failure in TestRowCounter.
>
>
> On Tue, Oct 30, 2012 at 9:29 AM, Kihwal Lee <kihwal@yahoo-inc.com> wrote:
>
>> Ted,
>>
>> I couldn't reproduce it by just running the test case. When you reproduce
>> it, look at the stderr/stdout file somewhere under
>> target/org.apache.hadoop.mapred.MiniMRCluster. Look for the one under the
>> directory whose name containing the app id.
>>
>> I did run into a similar problem and the stderr said:
>> /bin/bash: /bin/java: No such file or directory
>>
>> It was because JAVA_HOME was not set. But in this case the exit code was
>> 127 (shell not being able to locate the command to exec). In the hudson
>> job, the exit code was 1, so I think it's something else.
>>
>> Kihwal
>>
>> On 10/29/12 11:56 PM, "Ted Yu" <yuzhihong@gmail.com> wrote:
>>
>> >TestRowCounter still fails:
>> >
>> https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/244/testReport/j
>>
>> >unit/org.apache.hadoop.hbase.mapreduce/TestRowCounter/testRowCounterNoColu
>> >mn/
>> >
>> >but there was no 'divide by zero' exception.
>> >
>> >Cheers
>> >
>> >On Thu, Oct 25, 2012 at 8:04 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>> >
>> >> I will try 2.0.2-alpha release.
>> >>
>> >> Cheers
>> >>
>> >>
>> >> On Thu, Oct 25, 2012 at 7:54 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>> >>
>> >>> Thanks for the quick response, Robert.
>> >>> Here is the hadoop version being used:
>> >>>     <hadoop-two.version>2.0.1-alpha</hadoop-two.version>
>> >>>
>> >>> If there is newer release, I am willing to try that before filing
>> JIRA.
>> >>>
>> >>>
>> >>> On Thu, Oct 25, 2012 at 7:07 AM, Robert Evans
>> >>><evans@yahoo-inc.com>wrote:
>> >>>
>> >>>> It looks like you are running with an older version of 2.0, even
>> >>>>though
>> >>>> it
>> >>>> does not really make much of a difference in this case,  The issue
>> >>>>shows
>> >>>> up when getLocalPathForWrite thinks there is no space on to write
to
>> >>>>on
>> >>>> any of the disks it has configured.  This could be because you do
not
>> >>>> have
>> >>>> any directories configured.  I really don't know for sure exactly
>> >>>>what is
>> >>>> happening.  It might be disk fail in place removing disks for you
>> >>>>because
>> >>>> of other issues. Either way we should file a JIRA against Hadoop
to
>> >>>>make
>> >>>> it so we never get the / by zero error and provide a better way
to
>> >>>>handle
>> >>>> the possible causes.
>> >>>>
>> >>>> --Bobby Evans
>> >>>>
>> >>>> On 10/24/12 11:54 PM, "Ted Yu" <yuzhihong@gmail.com> wrote:
>> >>>>
>> >>>> >Hi,
>> >>>> >HBase has Jenkins build against hadoop 2.0
>> >>>> >I was checking why TestRowCounter sometimes failed:
>> >>>> >
>> >>>>
>> >>>>
>> https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/231/testRepor
>> >>>>t/o
>> >>>>
>> >>>>
>>
>> >>>>>rg.apache.hadoop.hbase.mapreduce/TestRowCounter/testRowCounterExclusiv
>> >>>>>eCol
>> >>>> >umn/
>> >>>> >
>> >>>> >I think the following could be the cause:
>> >>>> >
>> >>>> >2012-10-22 23:46:32,571 WARN  [AsyncDispatcher event handler]
>> >>>> >resourcemanager.RMAuditLogger(255): USER=jenkins
>> >>>> OPERATION=Application
>> >>>> >Finished - Failed      TARGET=RMAppManager     RESULT=FAILURE
>> >>>>  DESCRIPTION=App
>> >>>> >failed with state: FAILED      PERMISSIONS=Application
>> >>>> >application_1350949562159_0002 failed 1 times due to AM Container
>> for
>> >>>> >appattempt_1350949562159_0002_000001 exited with  exitCode:
-1000
>> due
>> >>>> >to: java.lang.ArithmeticException: / by zero
>> >>>> >       at
>> >>>>
>> >>>>
>>
>> >>>>>org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPat
>> >>>>>hFor
>> >>>> >Write(LocalDirAllocator.java:355)
>> >>>> >       at
>> >>>>
>> >>>>
>>
>> >>>>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAl
>> >>>>>loca
>> >>>> >tor.java:150)
>> >>>> >       at
>> >>>>
>> >>>>
>>
>> >>>>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAl
>> >>>>>loca
>> >>>> >tor.java:131)
>> >>>> >       at
>> >>>>
>> >>>>
>>
>> >>>>>org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAl
>> >>>>>loca
>> >>>> >tor.java:115)
>> >>>> >       at
>> >>>>
>> >>>>
>>
>> >>>>>org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getL
>> >>>>>ocal
>> >>>> >PathForWrite(LocalDirsHandlerService.java:257)
>> >>>> >       at
>> >>>>
>> >>>>
>>
>> >>>>>org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.R
>> >>>>>esou
>> >>>>
>> >>>>
>>
>> >>>>>rceLocalizationService$LocalizerRunner.run(ResourceLocalizationService
>> >>>>>.jav
>> >>>> >a:849)
>> >>>> >
>> >>>> >However, I don't seem to find where in getLocalPathForWrite()
>> >>>>division
>> >>>> by
>> >>>> >zero could have arisen.
>> >>>> >
>> >>>> >Comment / hint is welcome.
>> >>>> >
>> >>>> >Thanks
>> >>>>
>> >>>>
>> >>>
>> >>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message