What are you vm/gc settings? Let's tune that!
On Jun 11, 2009 7:08 PM, "Bradford Stephens" <bradfordstephens@gmail.com>
wrote:
OK, so I discovered the ulimit wasn't changed like I thought it was,
had to fool with PAM in Ubuntu.
Everything's running a little better, and I cut the data size by 66%.
It took a while, but one of the machines with only 2 cores failed, and
I caught it in the moment. Then 2 other machiens failed a few minutes
later in a cascade. I'm thinking that HBase +Hadoop takes up so much
proc time that the machine gradually stops responding to heartbeat....
does that seem rational?
Here's the first regionserver log: http://pastebin.com/m96e06fe
I wish I could attach the log of one of the regionservers that failed
a few minutes later, but it's 708MB! Here's some examples of the tail:
2009-06-11 19:00:18,418 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report
to master for 906196 milliseconds - retrying
2009-06-11 19:00:18,419 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: error getting
store file index size for 944890031/url:
java.io.FileNotFoundException: File does not exist:
hdfs://dttest01:54310/hbase-0.19/joinedcontent/944890031/url/mapfiles/2512503149715575970/index
The HBase Master log is surprisingly quiet...
Overall, I think HBase just isn't happy on a machine with two
single-core procs, and when they start dropping like flies, everything
goes to hell. Do my log files support this?
Cheers,
Bradford
On Wed, Jun 10, 2009 at 4:01 PM, Ryan Rawson<ryanobjc@gmail.com> wrote: >
Hey, > > Looks lke you h...
|