hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatsuya Kawano <tatsuy...@snowcocoa.info>
Subject Re: HBase 0.20.1 on Ubuntu 9.04: master fails to start
Date Thu, 29 Oct 2009 22:35:25 GMT
Hi Artyom,

Thanks for the extra info; it is very helpful.

>> So, for the problem #3, your master and client logs tell that the
>> HBase region server is not responding on port 53169. However it
>> doesn't tell why it's not responding. You should have region server
>> log in the logs directory as well, so can you check it if there is any
>> error message?

> I checked regionserver logs, but regionserver was not started:
>
>> 2009-10-29 13:34:13,754 WARN
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Not starting a distinct
>> region server because hbase.cluster.distributed is false

Sorry, I was wrong. You're running HBase in pseudo distribution mode,
so the region server doesn't run as a separate process but is embedded
in the same process that running HBase master. The master server's log
actually contained the region server's log as well.


> Full hbase and hadoop logs can be found in my post at Thu, 29 Oct, 07:52

I didn't realize those attachments. So now I checked them, and found
the followings in "hbase-hbase-master-localhost.log.2009-10-28":

==========================================================================
2009-10-28 15:26:36,888 INFO
org.apache.hadoop.hbase.master.ServerManager: 1 region servers, 0
dead, average load 5.0
2009-10-28 15:26:37,114 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner
scanning meta region {server: 127.0.0.1:53169, regionname: -ROOT-,,0,
startKey: <>}
2009-10-28 15:26:37,135 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner
scan of 1 row(s) of meta region {server: 127.0.0.1:53169, regionname:
-ROOT-,,0, startKey: <>} complete
2009-10-28 15:26:37,311 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner
scanning meta region {server: 127.0.0.1:53169, regionname: .META.,,1,
startKey: <>}
2009-10-28 15:26:37,382 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner
scan of 3 row(s) of meta region {server: 127.0.0.1:53169, regionname:
.META.,,1, startKey: <>} complete
2009-10-28 15:26:37,382 INFO
org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s)
scanned
2009-10-28 15:26:58,670 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-1247814523023719701 lease expired
2009-10-28 15:27:01,847 INFO org.apache.hadoop.ipc.HBaseServer:
Stopping server on 53169
2009-10-28 15:27:01,848 INFO org.apache.hadoop.ipc.HBaseServer:
Stopping IPC Server listener on 53169
2009-10-28 15:27:01,848 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 9 on 53169: exiting
2009-10-28 15:27:01,960 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping
infoServer
2009-10-28 15:27:02,009 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 7 on 53169: exiting
2009-10-28 15:27:02,025 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 0 on 53169: exiting
2009-10-28 15:27:02,025 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 8 on 53169: exiting
2009-10-28 15:27:02,026 INFO org.apache.hadoop.ipc.HBaseServer:
Stopping IPC Server Responder
2009-10-28 15:27:02,039 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 4 on 53169: exiting
2009-10-28 15:27:02,039 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 6 on 53169: exiting
2009-10-28 15:27:02,039 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 2 on 53169: exiting
2009-10-28 15:27:02,039 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 3 on 53169: exiting
2009-10-28 15:27:02,481 INFO org.apache.hadoop.hdfs.DFSClient: Could
not obtain block blk_5009444910783378943_2681 from any node:
java.io.IOException: No live nodes contain current block
2009-10-28 15:27:02,504 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
RegionServer:0.cacheFlusher exiting
2009-10-28 15:27:02,505 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker:
RegionServer:0.majorCompactionChecker exiting
2009-10-28 15:27:02,505 INFO
org.apache.hadoop.hbase.regionserver.LogFlusher:
RegionServer:0.logFlusher exiting
2009-10-28 15:27:02,505 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
RegionServer:0.compactor exiting
2009-10-28 15:27:02,505 INFO
org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
2009-10-28 15:27:02,702 INFO org.apache.hadoop.hdfs.DFSClient: Could
not obtain block blk_6684004629716332214_2681 from any node:
java.io.IOException: No live nodes contain current block
2009-10-28 15:27:03,578 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread
exiting
2009-10-28 15:27:14,816 WARN
org.apache.hadoop.hbase.regionserver.Store: Not in
setorg.apache.hadoop.hbase.regionserver.StoreScanner@3cf294
2009-10-28 15:27:14,828 WARN
org.apache.hadoop.hbase.regionserver.Store: Not in
setorg.apache.hadoop.hbase.regionserver.StoreScanner@1611ef6
2009-10-28 15:27:14,828 WARN
org.apache.hadoop.hbase.regionserver.Store: Not in
setorg.apache.hadoop.hbase.regionserver.StoreScanner@137c0c7
2009-10-28 15:27:14,828 WARN
org.apache.hadoop.hbase.regionserver.Store: Not in
setorg.apache.hadoop.hbase.regionserver.StoreScanner@15151a5
2009-10-28 15:27:14,829 WARN org.apache.hadoop.ipc.HBaseServer: IPC
Server Responder, call next(-1247814523023719701, 1) from
127.0.0.1:41489: output error
2009-10-28 15:27:14,830 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 5 on 53169 caught:
java.nio.channels.ClosedChannelException
	at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
	at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
	at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
	at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
	at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)

2009-10-28 15:27:14,830 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 5 on 53169: exiting
2009-10-28 15:27:36,888 INFO
org.apache.hadoop.hbase.master.ServerManager: 1 region servers, 0
dead, average load 5.0
==========================================================================


Your region server gracefully shut itself down because it detected
lease expiration on a scanner. Lease expiration means it lost a
heartbeat signal from other parts of the cluster, which indicate a
serious failure in the cluster.

A lease could expire by a network problem or a cluster process is
killed or stalled. Since this occurred while the region server was
busy working on a series of SingleColumnValueFilters in a small RAM
environment, there may be some other cluster processes got swapped out
from RAM and almost stalled.

(By the way, it's a kind of confusing that HBase master was still
reporting there is 1 region server available while there isn't. I
guess this is because pseudo distribution mode and the master got
confused.)


> I'll  try to increase RAM capacity. And then I'll write here about results

So adding more RAM to the server will definitely help. Also, I never
tried this and maybe a wrong advice, but you could decrease HBase and
Hadoop processes' Java VM heap size so that they can avoid to be
swapping out from RAM and the leases will be never(?) expired. You can
configure the heap size in "hbase-env.sh" and "hadoop-env.sh", but do
this with a caution -- decreasing the heap size would cause
OutOfMemoryError or serious slow down of the process, and you could
end up with the same (or even worse) situation you have now.


Also, I saw you have many column families in your table "channel_products":

-- active
-- channel_cat
-- channel_id
-- contract_id
-- shop_category_id
-- shop_id
-- shop_product_id
-- created_at
-- updated_at

Try to have less. You can group them together in a fewer number of
column families, so that the region server can access each column
faster and consume less memory. Be aware that different column
families are stored in different files on the disk, so you can
optimize locality of the columns by grouping them together in some
column families.

Good luck,

-- 
Tatsuya Kawano (Mr.)
Tokyo, Japan



On Thu, Oct 29, 2009 at 9:50 PM, Artyom Shvedchikov <shoolc@gmail.com> wrote:
> Dear, Tatsuya
>
> 1. Delete hadoop data directory
>> 2. bin/hadoop namenode -format
>> 3. bin/start-all.sh
>>    -> namenode will start immediately and go in service, but data
>> node will be making a long (almost seven minutes) pause in a middle of
>> the startup.
>>
>> 4. Before the data node becomes ready, do an HDFS write operation
>> (e.g. "bin/hadoop fs -put conf input"), and then the write operations
>> will fail with the following error:
>>
>
> Today I tried to restart Hadoop and HBase skipping step #1 and step #2.
> First I stop HBase, then Hadoop and then start Hadoop, wait for 10 minutes
> and start HBase - it works. Data was not lost and was available to read and
> etc. Then I tried to scan several times the table with 6 000 000 rows and
> HBase hanged down again with the same exceptions as in my previous post (see
> post at Thu, 29 Oct, 10:06).
>
> hbase(main):006:0> list
>> NativeException: org.apache.hadoop.hbase.client.RetriesExhaustedException:
>> Trying to contact region server 127.0.0.1:57613 for region .META.,,1, row
>> '', but failed after 5 attempts.
>> Exceptions:
>> java.net.ConnectException: Connection refused
>> java.net.ConnectException: Connection refused
>> java.net.ConnectException: Connection refused
>> java.net.ConnectException: Connection refused
>> java.net.ConnectException: Connection refused
>>
>>     from org/apache/hadoop/hbase/client/HConnectionManager.java:1001:in
>> `getRegionServerWithRetries'
>>     from org/apache/hadoop/hbase/client/MetaScanner.java:55:in `metaScan'
>>     from org/apache/hadoop/hbase/client/MetaScanner.java:28:in `metaScan'
>>     from org/apache/hadoop/hbase/client/HConnectionManager.java:432:in
>> `listTables'
>>     from org/apache/hadoop/hbase/client/HBaseAdmin.java:127:in `listTables'
>>     from sun/reflect/NativeMethodAccessorImpl.java:-2:in `invoke0'
>>     from sun/reflect/NativeMethodAccessorImpl.java:39:in `invoke'
>>     from sun/reflect/DelegatingMethodAccessorImpl.java:25:in `invoke'
>>     from java/lang/reflect/Method.java:597:in `invoke'
>>     from org/jruby/javasupport/JavaMethod.java:298:in
>> `invokeWithExceptionHandling'
>>     from org/jruby/javasupport/JavaMethod.java:259:in `invoke'
>>     from org/jruby/java/invokers/InstanceMethodInvoker.java:36:in `call'
>>     from org/jruby/runtime/callsite/CachingCallSite.java:70:in `call'
>>     from org/jruby/ast/CallNoArgNode.java:61:in `interpret'
>>     from org/jruby/ast/ForNode.java:104:in `interpret'
>>     from org/jruby/ast/NewlineNode.java:104:in `interpret'
>> ... 110 levels...
>>     from hadoop/hbase/bin/$_dot_dot_/bin/hirb#start:-1:in `call'
>>     from org/jruby/internal/runtime/methods/DynamicMethod.java:226:in
>> `call'
>>     from org/jruby/internal/runtime/methods/CompiledMethod.java:211:in
>> `call'
>>     from org/jruby/internal/runtime/methods/CompiledMethod.java:71:in
>> `call'
>>     from org/jruby/runtime/callsite/CachingCallSite.java:253:in
>> `cacheAndCall'
>>     from org/jruby/runtime/callsite/CachingCallSite.java:72:in `call'
>>     from hadoop/hbase/bin/$_dot_dot_/bin/hirb.rb:497:in `__file__'
>>     from hadoop/hbase/bin/$_dot_dot_/bin/hirb.rb:-1:in `load'
>>     from org/jruby/Ruby.java:577:in `runScript'
>>     from org/jruby/Ruby.java:480:in `runNormally'
>>     from org/jruby/Ruby.java:354:in `runFromMain'
>>     from org/jruby/Main.java:229:in `run'
>>     from org/jruby/Main.java:110:in `run'
>>     from org/jruby/Main.java:94:in `main'
>>     from /hadoop/hbase/bin/../bin/hirb.rb:338:in `list'
>>     from (hbase):7hbase(main):007:0> status
>> 0 servers, 0 dead, NaN average load
>> hbase(main):008:0> exit
>>
>
> Full hbase and hadoop logs can be found in my post at Thu, 29 Oct, 07:52
>
> The main issue for now is that HBase hangs down each time after I try to
> scan the table (after second or third time). By the way, this time it was
> enough to restart HBase only. And it was became available to scan/get/put
> operations.
>
> Table structure:
>
> hbase(main):003:0> describe 'channel_products'
>> DESCRIPTION
>> ENABLED
>>  {NAME => 'channel_products', FAMILIES => [{NAME => 'active', VERSIONS
>> true
>>  => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE =>
>> '6553
>>  6', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME =>
>> 'channel_cat
>>  egory_id', VERSIONS => '3', COMPRESSION => 'NONE', TTL =>
>> '2147483647'
>>  , BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
>> {
>>  NAME => 'channel_id', VERSIONS => '3', COMPRESSION => 'NONE', TTL =>
>> '
>>  2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
>> =>
>>   'true'}, {NAME => 'contract_id', VERSIONS => '3', COMPRESSION =>
>> 'NON
>>  E', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
>> B
>>  LOCKCACHE => 'true'}, {NAME => 'created_at', VERSIONS => '3',
>> COMPRESS
>>  ION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY
>> =>
>>   'false', BLOCKCACHE => 'true'}, {NAME => 'shop_category_id',
>> VERSIONS
>>   => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE =>
>> '655
>>  36', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME =>
>> 'shop_id',
>>  VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647',
>> BLOCKSIZE
>>   => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME =>
>> 'sh
>>  op_product_id', VERSIONS => '3', COMPRESSION => 'NONE', TTL =>
>> '214748
>>  3647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE =>
>> 'true
>>  '}, {NAME => 'updated_at', VERSIONS => '3', COMPRESSION => 'NONE',
>> TTL
>>   => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
>> BLOCKCAC
>>  HE =>
>> 'true'}]}
>>
>> 1 row(s) in 0.0630 seconds
>>
>
> Table contains ~ 6 000 000 rows, each value is a String.
>
> Code to scan the table:
>
>  protected void doGet(HttpServletRequest request, HttpServletResponse
>> response) throws ServletException, IOException {
>>
>  Date startDate = new Date();
>>
>  Date finishDate;
>>
>  log(startDate + ": Get activation status started");
>>
>  String shop_id = request.getParameter("shop_id");
>>
>
>>   String[] shop_product_ids =
>> request.getParameterValues("shop_product_ids");
>>
>  if (shop_product_ids != null && shop_product_ids.length == 1) {
>>
>  shop_product_ids = shop_product_ids[0].split(",");
>>
>  }
>>
>
>>   String channel_id = request.getParameter("channel_id");
>>
>  String channel_category_id = request.getParameter("channel_category_id");
>>
>
>>   String tableName = "channel_products";
>>
>  StringBuffer result = new StringBuffer("<?xml version=\"1.0\"?>");
>>
>
>>   if (this.admin.tableExists(tableName)) {
>>
>  result.append("<result>");
>>
>
>>   HTable table = new HTable(this.configuration, tableName);
>>
>
>>   Scan scan = new Scan();
>>
>
>>   FilterList mainFilterList = new FilterList();
>>
>
>>   if (shop_id != null) {
>>
>  mainFilterList.addFilter(new
>> SingleColumnValueFilter(Bytes.toBytes("shop_id"),Bytes.toBytes(""),CompareFilter.CompareOp.EQUAL,
>> Bytes.toBytes(shop_id)));
>>
>  }
>>
>  if (channel_id != null) {
>>
>  mainFilterList.addFilter(new
>> SingleColumnValueFilter(Bytes.toBytes("channel_id"),Bytes.toBytes(""),CompareFilter.CompareOp.EQUAL,
>> Bytes.toBytes(channel_id)));
>>
>  }
>>
>  if (channel_category_id != null) {
>>
>  mainFilterList.addFilter(new
>> SingleColumnValueFilter(Bytes.toBytes("channel_category_id"),Bytes.toBytes(""),CompareFilter.CompareOp.EQUAL,
>> Bytes.toBytes(channel_category_id)));
>>
>  }
>>
>
>>   if (shop_product_ids != null && shop_product_ids.length > 0) {
>>
>  List<Filter> filterList = new ArrayList<Filter>();
>>
>  for (String shop_product_id : shop_product_ids) {
>>
>  filterList.add(new
>> SingleColumnValueFilter(Bytes.toBytes("shop_product_id"),Bytes.toBytes(""),CompareFilter.CompareOp.EQUAL,
>> Bytes.toBytes(shop_product_id)));
>>
>  }
>>
>  FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ONE,
>> filterList);
>>
>  mainFilterList.addFilter(filters);
>>
>  }
>>
>
>>   scan.setFilter(mainFilterList);
>>
>  ResultScanner scanner = null;
>>
>  try {
>>
>  scanner = table.getScanner(scan);
>>
>  for (Result item : scanner) {
>>
>  getItemXml(result, item);
>>
>  }
>>
>  } catch (Exception e) {
>>
>  logError("Error during table scan: ", e);
>>
>  result.append("<error>").append("Error during table scan: " +
>> e).append("</error>");
>>
>  } finally {
>>
>  try {
>>
>  scanner.close();
>>
>  } catch (Exception e1) {
>>
>  //Can be null, skip
>>
>  }
>>
>  result.append("</result>");
>>
>  }
>>
>  } else {
>>
>  result.append("<result>").append("Table " + tableName + " not
>> exists!").append("</result>");
>>
>  }
>>
>  finishDate = new Date();
>>
>  log(finishDate + ": Get activation status finihed, duration: " +
>> (finishDate.getTime() - startDate.getTime()) + " ms");
>>
>
>>   response.getOutputStream().print(result.toString());
>>
>  }
>>
>
> I checked regionserver logs, but regionserver was not started:
>
>> 2009-10-29 13:34:13,754 WARN
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Not starting a distinct
>> region server because hbase.cluster.distributed is false
>
> HBase and Hadoop were configured according to "Getting Started" section on *
> hadoop.org*. They are both started in Pseudo-distributed mode.
> May be I should set this setting *hbase.cluster.distributed* to true?
> I'll  try to increase RAM capacity. And then I'll write here about results
> -------------------------------------------------
> Best wishes, Artyom Shvedchikov
>

Mime
View raw message