hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vidhyashankar Venkataraman <vidhy...@yahoo-inc.com>
Subject Re: More on Large scale Scans...
Date Wed, 28 Jul 2010 18:03:01 GMT
Yeah I got it in some:
2010-07-27 22:53:31,703 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x32a0190c21e0030
to sun.nio.ch.SelectionKeyImpl@7971f189
java.io.IOException: TIMED OUT
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-07-27 22:53:31,937 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats:
Sizes: Total=1847.005MB (1936725152), Free=552.8263MB (579680352), Max=2399.8313MB (2516405504),
Counts: Blocks=1811, Access=1332343, Hit=1811, Miss=1330532, Evictions=3177, Evicted=759144,
Ratios: Hit Ratio=0.13592595933005214%, Miss Ratio=99.86407160758972%, Evicted/Run=238.949951171875
2010-07-27 22:53:31,936 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call
next(-8452685834063022439, 30) from 98.136.191.84:57851: output error
2010-07-27 22:53:31,827 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call
next(-5892917023728878793, 30) from 98.136.191.84:57817: output error
2010-07-27 22:53:31,998 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call
next(-299811806597554323, 30) from 98.136.191.84:57840: output error
2010-07-27 22:53:31,822 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call
next(-2595212217395389635, 30) from 98.136.191.84:57829: output error
2010-07-27 22:53:31,718 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 141975ms, ten
times longer than scheduled: 10000
2010-07-27 22:53:32,144 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60020
caught: java.nio.channels.ClosedChannelException

Vidhya


On 7/28/10 10:51 AM, "Ted Yu" <yuzhihong@gmail.com> wrote:

For the region servers that went down, do you find the following in their
logs ?

2010-06-17 22:22:47,414 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
55448ms, ten times longer than scheduled: 3000

On Wed, Jul 28, 2010 at 10:30 AM, Vidhyashankar Venkataraman <
vidhyash@yahoo-inc.com> wrote:

> I reran my Scans on a 180 node cluster (with 160 RS's/data nodes/task
> trackers) using TableInputFormat with block cache disabled and I got the
> following results/observations:
>
>  1.  90% of the job completed in roughly 5 and a half hours which comes to
> 29 MBps per node (counting 160 RS's).. Is this reasonable? Can I get better
> results?
>
>    2.  The last 10% of the job has still been running for more than 3
> hours.. I am trying to find some common thread of reasoning why this is
> happening.
>
>    3. I tried disabling the block cache in the MR job (I was using
> TableMapReduceUtil: see code snippet below) and through hbase shell just to
> change default options (disable the data and change using alter options)..
>
>  conf.set(TableInputFormat.SCAN_CACHEBLOCKS, "false");
>    Job job = new Job(conf, "Scan" + "_" + tableName);
>    ..........
>    Scan scan = new Scan();
>    scan.addColumns("bigColumn");
>    scan.setCaching(10);
>    scan.setCacheBlocks(false);                      // Setting it again.. I
> guess this will override the previous setting..
>    // Second argument is the table name.
>    TableMapReduceUtil.initTableMapperJob(tableName, scan,
>      MapperHB.class, Text.class, NullWritable.class, job);
>
> But I still get LruBlockcache stats like the following ones: Is this okay?
>
> 2010-07-27 14:13:40,852 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB
> (2516405504), Counts: Blocks=1914, Access=1304510, Hit=1914, Miss=1302596,
> Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.1467217574827373%, Miss
> Ratio=99.8532772064209%, Evicted/Run=238.943939208984382010-07-27
> 14:14:40,860 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache
> Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB (471601336),
> Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1306103, Hit=1914,
> Miss=1304189, Evictions=3300, Evicted=788515, Ratios: Hit
> Ratio=0.14654280385002494%, Miss Ratio=99.85345602035522%,
> Evicted/Run=238.943939208984382010-07-27 14:15:40,853 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB
> (2516405504), Counts: Blocks=1914, Access=1307783, Hit=1914, Miss=1305869,
> Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14635456027463078%, Miss
> Ratio=99.8536467552185%, Evicted/Run=238.94393920898438
> ................
> .................
> 2010-07-27 22:45:40,853 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB
> (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914, Miss=1346513,
> Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%, Miss
> Ratio=99.85805749893188%, Evicted/Run=238.94393920898438
> 2010-07-27 22:46:40,854 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB
> (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914, Miss=1346513,
> Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%, Miss
> Ratio=99.85805749893188%, Evicted/Run=238.94393920898438
> 2010-07-27 22:47:40,853 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB
> (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914, Miss=1346513,
> Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%, Miss
> Ratio=99.85805749893188%, Evicted/Run=238.94393920898438
> 2010-07-27 22:48:40,853 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB
> (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914, Miss=1346513,
> Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%, Miss
> Ratio=99.85805749893188%, Evicted/Run=238.94393920898438
> 2010-07-27 22:49:40,876 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB
> (2516405504), Counts: Blocks=1914, Access=1348659, Hit=1914, Miss=1346745,
> Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14191875234246254%, Miss
> Ratio=99.8580813407898%, Evicted/Run=238.94393920898438
> 2010-07-27 22:50:41,140 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB
> (2516405504), Counts: Blocks=1914, Access=1349089, Hit=1914, Miss=1347175,
> Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.1418735133484006%, Miss
> Ratio=99.85812902450562%, Evicted/Run=238.94393920898438
>
>
>    4.   A minor number of region servers (8 out of 150) crash (more of a
> graceful shutdown) while performing the scan and I saw these messages just
> before the shutdown started:
>
> 2010-07-27 22:52:57,448 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x2a0190c221002c to sun.nio.ch.SelectionKeyImpl@286e4365
> java.io.IOException: TIMED OUT
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-07-27 22:52:57,454 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB
> (2516405504), Counts: Blocks=1914, Access=1349486, Hit=1914, Miss=1347572,
> Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14183177845552564%, Miss
> Ratio=99.85817074775696%, Evicted/Run=238.94393920898438
> 2010-07-27 22:52:57,454 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB
> (2516405504), Counts: Blocks=1914, Access=1349486, Hit=1914, Miss=1347572,
> Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14183177845552564%, Miss
> Ratio=99.85817074775696%, Evicted/Run=238.94393920898438
> 2010-07-27 22:52:57,548 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call next(-30527588457879943, 30) from 98.136.191.88:33914:
> output error
> 2010-07-27 22:52:57,548 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call next(-7358804533352185527, 30) from 98.136.191.88:33963:
> output error
> 2010-07-27 22:52:57,602 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call next(-3402818986992405350, 30) from 98.136.191.88:33927:
> output error
> 2010-07-27 22:52:57,603 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call next(-6635348671997204137, 30) from 98.136.191.88:33916:
> output error
> 2010-07-27 22:52:57,603 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path: null
> 2010-07-27 22:52:57,603 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 21 on 60020 caught: java.nio.channels.ClosedChannelException
> ....................
> .....................
> 2010-07-27 22:52:58,059 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>
> Thanks in advance,
> Vidhya
>
> On 7/26/10 2:43 PM, "Vidhyashankar Venkataraman" <vidhyash@yahoo-inc.com>
> wrote:
>
> I am trying to assess the performance of Scans on a 100TB db on 180 nodes
> running Hbase 0.20.5..
>
> I run a sharded scan (each Map task runs a scan on a specific range:
> speculative execution is turned false so that there is no duplication in
> tasks) on a fully compacted table...
>
> 1 MB block size, Block cache enabled.. Max of 2 tasks per node..  Each row
> is 30 KB in size: 1 big column family with just one field..
> Region lease timeout is set to an hour.. And I don't get any socket timeout
> exceptions so I have not reassigned the write socket timeout...
>
> I ran experiments on the following cases:
>
>  1.  The client level cache is set to 1 (default: got he number using
> getCaching): The MR tasks take around 13 hours to finish in the average..
> Which gives around 13.17 MBps per node. The worst case is 34 hours (to
> finish the entire job)...
>  2.  Client cache set to 20 rows: this is much worse than the previous
> case: we get around a super low 1MBps per node...
>
>         Question: Should I set it to a value such that the block size is a
> multiple of the above said cache size? Or the cache size to a much lower
> value?
>
> I find that these numbers are much less than the ones I get when it's
> running with just a few nodes..
>
> Can you guys help me with this problem?
>
> Thank you
> Vidhya
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message