hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bradford Stephens <bradfordsteph...@gmail.com>
Subject Re: HBase Failing on Large Loads
Date Fri, 12 Jun 2009 02:07:38 GMT
OK, so I discovered the ulimit wasn't changed like I thought it was,
had to fool with PAM in Ubuntu.

Everything's running a little better, and I cut the data size by 66%.

It took a while, but one of the machines with only 2 cores failed, and
I caught it in the moment. Then 2 other machiens failed a few minutes
later in a cascade. I'm thinking that HBase +Hadoop takes up so much
proc time that the machine gradually stops responding to heartbeat....
does that seem rational?

Here's the first regionserver log: http://pastebin.com/m96e06fe
I wish I could attach the log of one of the regionservers that failed
a few minutes later, but it's 708MB! Here's some examples of the tail:

 2009-06-11 19:00:18,418 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report
to master for 906196 milliseconds - retrying
2009-06-11 19:00:18,419 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: error getting
store file index size for 944890031/url:
java.io.FileNotFoundException: File does not exist:
hdfs://dttest01:54310/hbase-0.19/joinedcontent/944890031/url/mapfiles/2512503149715575970/index

The HBase Master log is surprisingly quiet...

Overall, I think HBase just isn't happy on a machine with two
single-core procs, and when they start dropping like flies, everything
goes to hell. Do my log files support this?

Cheers,
Bradford


On Wed, Jun 10, 2009 at 4:01 PM, Ryan Rawson<ryanobjc@gmail.com> wrote:
> Hey,
>
> Looks lke you have some HDFS issues.
>
> Things I did to make myself stable:
>
> - run HDFS with -Xmx=2000m
> - run HDFS with 2047 xciever limit (goes into hdfs-core.xml or
> hadoop-site.xml)
> - ulimit -n 32k - also important
>
> With this I find that HDFS is very stable, I've imported hundreds of gigs.
>
> You want to make sure the HDFS xciever limit is set in the hadoop/conf
> directory, copied to every node and HDFS restarted.  Also sounds like you
> might have a cluster with multiple versions of hadoop.  Double check that!
>
> you're close!
> -ryan
>
> On Wed, Jun 10, 2009 at 3:32 PM, Bradford Stephens <
> bradfordstephens@gmail.com> wrote:
>
>> Thanks so much for all the help, everyone... things are still broken,
>> but maybe we're getting close.
>>
>> All the regionservers were dead by the time the job ended.  I see
>> quite a few error messages like this:
>>
>> (I've put the entirety of the regionserver logs on pastebin:)
>> http://pastebin.com/m2e6f9283
>> http://pastebin.com/mf97bd57
>>
>> 2009-06-10 14:47:54,994 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer: unable to process
>> message: MSG_REGION_OPEN:
>> joinedcontent,1DCC1616F7C7B53B69B5536F407A64DF,1244667570521:
>> safeMode=false
>> java.lang.NullPointerException
>>
>> There's also a scattering of messages like this:
>> 2009-06-10 13:49:02,855 WARN
>> org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 1 on
>> 60020 took 3267ms appending an edit to HLog; editcount=21570
>>
>> aaand....
>>
>> 2009-06-10 14:03:27,270 INFO
>> org.apache.hadoop.hbase.regionserver.HLog: Closed
>>
>> hdfs://dttest01:54310/hbase-0.19/log_192.168.18.49_1244659862699_60020/hlog.dat.1244667757560,
>> entries=100006. New log writer:
>> /hbase-0.19/log_192.168.18.49_1244659862699_60020/hlog.dat.1244667807249
>> 2009-06-10 14:03:28,160 INFO org.apache.hadoop.hdfs.DFSClient:
>> Exception in createBlockOutputStream java.io.IOException: Bad connect
>> ack with firstBadLink 192.168.18.47:50010
>> 2009-06-10 14:03:28,160 INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_4831127457964871573_140781
>> 2009-06-10 14:03:34,170 INFO org.apache.hadoop.hdfs.DFSClient:
>> Exception in createBlockOutputStream java.io.IOException: Could not
>> read from stream
>> 2009-06-10 14:03:34,170 INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_-6169186743102862627_140796
>> 2009-06-10 14:03:34,485 INFO
>> org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Forced flushing
>> of joinedcontent,1F2F64F59088A3B121CFC66F7FCBA2A9,1244667654435
>> because global memcache limit of 398.7m exceeded; currently 399.0m and
>> flushing till 249.2m
>>
>> Finally, I saw this when I stopped and re-started my cluster:
>>
>> 2009-06-10 15:29:09,494 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(192.168.18.16:50010,
>> storageID=DS-486600617-192.168.18.16-50010-1241838200467,
>> infoPort=50075, ipcPort=50020):DataXceiver
>> java.io.IOException: Version Mismatch
>>        at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:81)
>>        at java.lang.Thread.run(Thread.java:619)
>>
>>
>> On Wed, Jun 10, 2009 at 2:55 PM, Ryan Rawson<ryanobjc@gmail.com> wrote:
>> > That is a client exception that is a sign of problems on the
>> > regionserver...is it still running? What do the logs look like?
>> >
>> > On Jun 10, 2009 2:51 PM, "Bradford Stephens" <bradfordstephens@gmail.com
>> >
>> > wrote:
>> >
>> > OK, I've tried all the optimizations you've suggested (still running
>> > with a M/R job). Still having problems like this:
>> >
>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>> > contact region server 192.168.18.15:60020 for region
>> > joinedcontent,242FEB3ED9BE0D8EF3856E9C4251464C,1244666594390, row
>> > '291DB5C7440B0A5BDB0C12501308C55B', but failed after 10 attempts.
>> > Exceptions:
>> > java.io.IOException: Call to /192.168.18.15:60020 failed on local
>> > exception: java.io.EOFException
>> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
>> > connection exception: java.net.ConnectException: Connection refused
>> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
>> > connection exception: java.net.ConnectException: Connection refused
>> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
>> > connection exception: java.net.ConnectException: Connection refused
>> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
>> > connection exception: java.net.ConnectException: Connection refused
>> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
>> > connection exception: java.net.ConnectException: Connection refused
>> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
>> > connection exception: java.net.ConnectException: Connection refused
>> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
>> > connection exception: java.net.ConnectException: Connection refused
>> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
>> > connection exception: java.net.ConnectException: Connection refused
>> > java.net.ConnectException: Call to /192.168.18.15:60020 failed on
>> > connection exception: java.net.ConnectException: Connection refused
>> >
>> > On Wed, Jun 10, 2009 at 12:40 AM, stack<stack@duboce.net> wrote: >
On
>> Tue,
>> > Jun 9, 2009 at 11:51 AM,...
>> >
>>
>

Mime
View raw message