hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nanheng Wu <nanhen...@gmail.com>
Subject Re: Disabling a table taking very long time
Date Tue, 01 Mar 2011 20:20:54 GMT
Sorry J-D, I tried one more thing. I greped for "compaction" on the
.META RS log. I saw a lot of them, each seem to be taking quite long
too (one took 23 secs). Would this be something worth looking into?

11/03/01 16:44:30 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 9sec
11/03/01 16:46:18 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:46:18 INFO regionserver.Store: Started compaction of 3
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982189854
11/03/01 16:46:27 INFO regionserver.Store: Completed compaction of 3
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/5138927692276532859;
store size is 11.1m
11/03/01 16:46:27 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 9sec
11/03/01 16:47:08 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:47:08 INFO regionserver.Store: Started compaction of 3
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982189946
11/03/01 16:47:18 INFO regionserver.Store: Completed compaction of 3
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/5873146955360312580;
store size is 11.1m
11/03/01 16:47:18 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 10sec
11/03/01 16:47:18 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:47:18 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 0sec
11/03/01 16:47:19 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:47:20 INFO regionserver.Store: Started compaction of 3
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982190038
11/03/01 16:47:43 INFO regionserver.Store: Completed compaction of 3
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/2025458053952320907;
store size is 11.2m
11/03/01 16:47:43 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 23sec
11/03/01 16:47:43 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:47:43 INFO regionserver.Store: Started compaction of 4
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982190222
11/03/01 16:47:46 INFO regionserver.Store: Completed compaction of 4
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/7645102490458560555;
store size is 11.2m
11/03/01 16:47:46 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 3sec
11/03/01 16:47:47 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:47:47 INFO regionserver.Store: Started compaction of 3
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982190267
11/03/01 16:47:57 INFO regionserver.Store: Completed compaction of 3
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/2555741109593674333;
store size is 11.2m
11/03/01 16:47:58 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 10sec
11/03/01 16:47:58 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:47:58 INFO regionserver.Store: Started compaction of 3
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982190359
11/03/01 16:48:08 INFO regionserver.Store: Completed compaction of 3
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/2032875459950620676;
store size is 11.2m
11/03/01 16:48:09 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 10sec
11/03/01 16:48:09 INFO regionserver.HRegion: Starting compaction on
region .META.,,1
11/03/01 16:48:09 INFO regionserver.Store: Started compaction of 3
file(s) in info of .META.,,1  into
/hbase/.META./compaction.dir/1028785192, seqid=1298982190451
11/03/01 16:48:22 INFO regionserver.Store: Completed compaction of 3
file(s) in info of .META.,,1; new storefile is
hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/1870868686430006280;
store size is 11.3m
11/03/01 16:48:22 INFO regionserver.HRegion: compaction completed on
region .META.,,1 in 13sec
11/03/01 16:48:22 INFO regionserver.HRegion: Starting compaction on
region .META.,,1



On Tue, Mar 1, 2011 at 11:01 AM, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
> That is correct. Also, disabling a table requires scanning and
> modifying the .META. region so if it happens to be slow then that
> operation will be slow too.
>
> J-D
>
> On Tue, Mar 1, 2011 at 10:55 AM, Nanheng Wu <nanhengwu@gmail.com> wrote:
>> I see, so the Get only gets affected on regions not in client's cache.
>> I think that explains why that error seems hard to reproduce. This is
>> how I understood a Get works: if the client doesn't have the region in
>> its cache it will read .META. table; and if it does it makes a RPC to
>> the RS directly; also each RS has some data about the regions it
>> actually hosts, so if the region is not there it means the client's
>> cache is stale and the client has to read the .META. again. Is this
>> correct?
>>
>> On Tue, Mar 1, 2011 at 10:48 AM, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
>>> Depends if you have to query .META. first to get the location of the
>>> region that hosts the row.
>>>
>>> J-D
>>>
>>> On Tue, Mar 1, 2011 at 10:45 AM, Nanheng Wu <nanhengwu@gmail.com> wrote:
>>>> Man I appreciate so much all the help you provided so far. I guess
>>>> I'll keep digging. Would this meta scan cause Get or Scan on user
>>>> tables to be very very slow too? That was actually was the problem I
>>>> first noticed: from time to time a client Get or Scan would be
>>>> extremely slow on my tables.
>>>>
>>>> On Tue, Mar 1, 2011 at 10:38 AM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>> The store file looks fairly new, maybe it was just major compacted?
>>>>> You could easily figure that out from the region server's log.
>>>>>
>>>>> Your problem is definitely a .META. region slowness. If scanning 8k
>>>>> rows takes 30 minutes, scanning and updating 260 rows at the same time
>>>>> probably isn't much faster.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Tue, Mar 1, 2011 at 10:34 AM, Nanheng Wu <nanhengwu@gmail.com>
wrote:
>>>>>> J-D,
>>>>>>
>>>>>> It looks like I don't have that many files:
>>>>>>
>>>>>> drwxr-xr-x   - root supergroup          0 2011-02-28 23:21
>>>>>> /hbase/.META./1028785192
>>>>>> -rw-r--r--   3 root supergroup        932 2011-02-03 19:23
>>>>>> /hbase/.META./1028785192/.regioninfo
>>>>>> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
>>>>>> /hbase/.META./1028785192/info
>>>>>> -rw-r--r--   3 root supergroup   11831330 2011-03-01 17:13
>>>>>> /hbase/.META./1028785192/info/6687606720393313750
>>>>>> drwxr-xr-x   - root supergroup          0 2011-03-01 17:13
>>>>>> /hbase/.META./compaction.dir
>>>>>>
>>>>>> So you think compaction on .META. table on the RS could be the
>>>>>> problem? while the log metaScanner is running I saw virtually no
logs
>>>>>> on that server though, I did see compactions once the meta scan ended
>>>>>> and region close started to happen. I wonder what the cluster is
doing
>>>>>> while the meta scan is going.
>>>>>>
>>>>>> I would definitely upgrade in the near future, it's just that I am
>>>>>> afraid upgrading won't guarantee a fix if I don't understand the
>>>>>> nature of the problem more clearly. We also have another test cluster,
>>>>>> running the same versions of everything on cheaper hardware and it
>>>>>> doesn't have this problem.
>>>>>>
>>>>>> On Tue, Mar 1, 2011 at 9:52 AM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>>>> That long .META. scan might be the culprit, good catch!
>>>>>>>
>>>>>>> If you run this from the hadoop folder:
>>>>>>>
>>>>>>> bin/hadoop dfs -lsr /hbase/.META.
>>>>>>>
>>>>>>> Do you see a ton of files? Let's define "ton" as more than 16
files.
>>>>>>> If so, and I see you have a very high number of regions, then
you
>>>>>>> could be hitting this problem:
>>>>>>> https://issues.apache.org/jira/browse/HBASE-3499
>>>>>>>
>>>>>>> In short, every few .META. change will result in a flush and
it could
>>>>>>> be that the region server hosting it isn't able to keep up with
the
>>>>>>> compactions. It's done so that since append isn't supported in
your
>>>>>>> hadoop that you won't lose too much data if the node dies, which
is
>>>>>>> really bad when it happens to .META.
>>>>>>>
>>>>>>> In your case it might not be so bad to set your .META.'s MEMSTORE_SIZE
>>>>>>> back to the default value if you're not writing hence not splitting.
>>>>>>> Running the script attached in that jira would do it (don't forget
to
>>>>>>> restart HBase after running it).
>>>>>>>
>>>>>>> Lastly, upgrading to HBase 0.90.1 and a hadoop that supports
append
>>>>>>> should be a priority.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Tue, Mar 1, 2011 at 9:30 AM, Nanheng Wu <nanhengwu@gmail.com>
wrote:
>>>>>>>> Hi J-D:
>>>>>>>>
>>>>>>>>  I did the scan like you suggested but no splits came up.
This kind
>>>>>>>> of makes sense to me, since we write Hfiles using a MR job
and then
>>>>>>>> pretty much use HBase for read-only so maybe that's why the
regions
>>>>>>>> never got split? I think my slow disable problem is related
to this
>>>>>>>> RegionManager.metaScanner. I greped for that in the Masters
log and I
>>>>>>>> saw:
>>>>>>>> 11/03/01 15:30:37 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>>>>> .META.,,1, startKey: <>}
>>>>>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>>>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>>>>> .META.,,1, startKey: <>}
>>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020,
>>>>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname:
>>>>>>>> .META.,,1, startKey: <>}
>>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>> 11/03/01 16:46:07 INFO master.BaseScanner: RegionManager.metaScanner
>>>>>>>> scan of 8298 row(s) of meta region {server: 10.146.14.38:60020,
>>>>>>>> regionname: .META.,,1, startKey: <>} complete
>>>>>>>>
>>>>>>>> It looks like meta scanner is always running, each time taking
20-30
>>>>>>>> minutes. Here's what I did:
>>>>>>>>
>>>>>>>> 1) disable table -> at 16:28:31, according the log above
there's a
>>>>>>>> meta scan in progress.
>>>>>>>>
>>>>>>>> 2 ) the command hangs, no output in .META RS's log
>>>>>>>> at 11/03/01 16:46:57 the client got 11/03/01 16:46:58 DEBUG
>>>>>>>> zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server
got
>>>>>>>> 10.146.14.38:60020
>>>>>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers:
Found
>>>>>>>> ROOT at 10.146.14.38:60020
>>>>>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers:
Cached
>>>>>>>> location for .META.,,1 is 10.146.14.38:60020
>>>>>>>> This corresponds exact to when the meta scan completed. .META
RS
>>>>>>>> started to spew some logs about closing regions.
>>>>>>>>
>>>>>>>> 3) Another meta scan started and the client hangs again until
it
>>>>>>>> finishes, and disable command returns.The whole thing took
2688.5930
>>>>>>>> seconds
>>>>>>>>
>>>>>>>> What kind of problem does it look like to you J-D? What does
this
>>>>>>>> background task metaScanner do?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 28, 2011 at 5:17 PM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>>>>>> I didn't get most of the information I was looking for...
but I do
>>>>>>>>> have something to work with. So I asked whether the metaScanner
was
>>>>>>>>> making progress or not, and from that log I think I see
it does:
>>>>>>>>>
>>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>>>>>>> Rowscanned=261, rowsOffline=3
>>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep.
Waiting for all
>>>>>>>>>> regions to be disabled from myTable
>>>>>>>>>
>>>>>>>>> Only 3 rows are offline, whereas we expect all the region
to be
>>>>>>>>> offline. I expect you see those lines a lot? If so, one
thing I'd like
>>>>>>>>> to see is the result of this command:
>>>>>>>>>
>>>>>>>>> scan '.META.', {STARTROW => "myTable,,", LIMIT =>
261}
>>>>>>>>>
>>>>>>>>> It's going to be big. Then grep in the result for the
string SPLIT,
>>>>>>>>> and please post back here the lines that match.
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <nanhengwu@gmail.com>
wrote:
>>>>>>>>>> I issued two more disable commands and looked at
the .META RS. The
>>>>>>>>>> last log entry looks like:
>>>>>>>>>>
>>>>>>>>>> 11/03/01 00:29:14 INFO regionserver.HLog: Roll
>>>>>>>>>> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog
>>>>>>>>>> .dat.1298935752660, entries=13527, calcsize=3839046,
filesize=2560143.
>>>>>>>>>> New hlog /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072
>>>>>>>>>>
>>>>>>>>>> I issued the command at: 00:34:59, the client debug
message has
>>>>>>>>>> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn:
Got ping
>>>>>>>>>> response for sessionid:0x12e6e938e9c000b after 90ms"
until 00:47:19.
>>>>>>>>>> Then all of a sudden it outputs the following:
>>>>>>>>>>
>>>>>>>>>> 0:47:28 DEBUG client.HConnectionManager$TableServers:
Cache hit for
>>>>>>>>>> row <> in tableName .META.: location server
10.146.14.38:60020,
>>>>>>>>>> location region name .META.,,1
>>>>>>>>>> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper:
Read ZNode
>>>>>>>>>> /hbase/root-region-server got 10.146.14.38:60020
>>>>>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers:
Found
>>>>>>>>>> ROOT at 10.146.14.38:60020
>>>>>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers:
Cached
>>>>>>>>>> location for .META.,,1 is 10.146.14.38:60020
>>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers:
>>>>>>>>>> Rowscanned=261, rowsOffline=3
>>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep.
Waiting for all
>>>>>>>>>> regions to be disabled from myTable
>>>>>>>>>> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake.
Waiting for all
>>>>>>>>>> regions to be disabled from myTable
>>>>>>>>>>
>>>>>>>>>> Here's the full thread dump of the .META RS from
00:34:59 to  00:47:31
>>>>>>>>>> http://pastebin.com/Ywwx8afQ
>>>>>>>>>>
>>>>>>>>>> I hope I compiled some useful information here. I
am pretty lost at
>>>>>>>>>> this point, I would really appreciate your help!
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans
<jdcryans@apache.org> wrote:
>>>>>>>>>>> Ok so no debug from the disable command? BTW
you should output your GC
>>>>>>>>>>> somewhere else like the .out file or a completely
separate file.
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <nanhengwu@gmail.com>
wrote:
>>>>>>>>>>>> I see, so I should jstack the .META region.
I'll do that.
>>>>>>>>>>>>
>>>>>>>>>>>> The master log pretty much looks like this:
should I grep for
>>>>>>>>>>>> something specific?
>>>>>>>>>>>>
>>>>>>>>>>>> 11/02/28 22:52:56 INFO master.BaseScanner:
RegionManager.rootScanner
>>>>>>>>>>>> scan of 1 row(s) of meta region complete
>>>>>>>>>>>> 2011-02-28T22:52:57.138+0000: 263241.014:
[GC 263241.014: [ParNew:
>>>>>>>>>>>> 19136K->2112K(19136K), 0.0033350 secs]
378606K->36159
>>>>>>>>>>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times:
user=0.02 sys=0.00,
>>>>>>>>>>>> real=0.00 secs]
>>>>>>>>>>>> 2011-02-28T22:53:04.956+0000: 263248.832:
[GC 263248.832: [ParNew:
>>>>>>>>>>>> 19068K->2112K(19136K), 0.0029690 secs]
378552K->36163
>>>>>>>>>>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times:
user=0.02 sys=0.00,
>>>>>>>>>>>> real=0.00 secs]
>>>>>>>>>>>> 2011-02-28T22:53:12.664+0000: 263256.540:
[GC 263256.541: [ParNew:
>>>>>>>>>>>> 19136K->2112K(19136K), 0.0037690 secs]
378654K->36164
>>>>>>>>>>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times:
user=0.01 sys=0.00,
>>>>>>>>>>>> real=0.01 secs]
>>>>>>>>>>>> 2011-02-28T22:53:20.608+0000: 263264.484:
[GC 263264.484: [ParNew:
>>>>>>>>>>>> 19136K->1810K(19136K), 0.0032600 secs]
378671K->36134
>>>>>>>>>>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times:
user=0.02 sys=0.00,
>>>>>>>>>>>> real=0.01 secs]
>>>>>>>>>>>> 2011-02-28T22:53:28.278+0000: 263272.154:
[GC 263272.154: [ParNew:
>>>>>>>>>>>> 18834K->2112K(19136K), 0.0029870 secs]
378370K->36166
>>>>>>>>>>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times:
user=0.01 sys=0.00,
>>>>>>>>>>>> real=0.01 secs]
>>>>>>>>>>>> 2011-02-28T22:53:35.919+0000: 263279.795:
[GC 263279.795: [ParNew:
>>>>>>>>>>>> 19136K->1747K(19136K), 0.0037090 secs]
378685K->36129
>>>>>>>>>>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times:
user=0.02 sys=0.00,
>>>>>>>>>>>> real=0.01 secs]
>>>>>>>>>>>> 11/02/28 22:53:36 INFO master.ServerManager:
9 region servers, 0 dead,
>>>>>>>>>>>> average load 760.2222222222222
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel
Cryans <jdcryans@apache.org> wrote:
>>>>>>>>>>>>> Are you able to tell if that call in
metaScanner is hanging or it's
>>>>>>>>>>>>> making multiple calls to the .META. region?
>>>>>>>>>>>>>
>>>>>>>>>>>>> If former, then jstack the region server
that hosts .META. and see
>>>>>>>>>>>>> where it's blocked.
>>>>>>>>>>>>>
>>>>>>>>>>>>> if latter, then it means your .META.
region is slow? Again, what's
>>>>>>>>>>>>> going on on the RS that hosts .META.?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Finally, what's the master's log like
during that time?
>>>>>>>>>>>>>
>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng
Wu <nanhengwu@gmail.com> wrote:
>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Thanks so much for your help so
far! I sent disable commands on 4
>>>>>>>>>>>>>> rather small tables and they got
stuck for a long time again, so I
>>>>>>>>>>>>>> took jstack of the master. From what
I can tell, all disableTable
>>>>>>>>>>>>>> calls are blocked by a meta scanner
thread (sample log below). At the
>>>>>>>>>>>>>> moment there were no other requests
to the server at all. How should I
>>>>>>>>>>>>>> investigate this further? If it helps,
here's some background: I have
>>>>>>>>>>>>>> several datasets, each of them is
in a separate table; Our data
>>>>>>>>>>>>>> pipeline produces a new version of
each dataset everyday and only the
>>>>>>>>>>>>>> lastest version should be used. This
is how the data is loaded: for
>>>>>>>>>>>>>> each dataset 1) run a MR job and
outputs HFiles 2) call loadTable.rb
>>>>>>>>>>>>>> to create a new table 3) disable
and drop the previous version. As a
>>>>>>>>>>>>>> result some calls to load table and
drop table would overlap. Please
>>>>>>>>>>>>>> let me know if something stands out
to you as a potential culprit.
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> BTW, I am running Hadoop 0.20.2 with
HBase 0.20.6
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thread 47 (IPC Server handler 13
on 60000):
>>>>>>>>>>>>>>  State: BLOCKED
>>>>>>>>>>>>>>  Blocked count: 3801
>>>>>>>>>>>>>>  Waited count: 72719
>>>>>>>>>>>>>>  Blocked on java.lang.Object@75ac522c
>>>>>>>>>>>>>>  Blocked by 27 (RegionManager.metaScanner)
>>>>>>>>>>>>>>  Stack:
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842)
>>>>>>>>>>>>>>    sun.reflect.GeneratedMethodAccessor9.invoke(Unknown
Source)
>>>>>>>>>>>>>>    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>>>>>>    java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thread 27 (RegionManager.metaScanner):
>>>>>>>>>>>>>>  State: WAITING
>>>>>>>>>>>>>>  Blocked count: 1526058
>>>>>>>>>>>>>>  Waited count: 1488998
>>>>>>>>>>>>>>  Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0
>>>>>>>>>>>>>>  Stack:
>>>>>>>>>>>>>>    java.lang.Object.wait(Native
Method)
>>>>>>>>>>>>>>    java.lang.Object.wait(Object.java:485)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>>>>>>>>>>>>    $Proxy1.get(Unknown Source)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
>>>>>>>>>>>>>>    org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 25, 2011 at 10:23 AM,
Jean-Daniel Cryans
>>>>>>>>>>>>>> <jdcryans@apache.org> wrote:
>>>>>>>>>>>>>>> An hour to disable? That doesn't
sound right at all :)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would approach this problem
like I generally do with HBase issue,
>>>>>>>>>>>>>>> first check the master log for
any weirdness regarding my problem (in
>>>>>>>>>>>>>>> this case, grep for the table
name).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then I would look the region
server log(s) of the nodes that were
>>>>>>>>>>>>>>> hosting regions from that table.
You should see the steps taken to
>>>>>>>>>>>>>>> disable the regions (starting
to close, flush, region completely
>>>>>>>>>>>>>>> closed).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If you are able to do it while
it's taking a very long time to
>>>>>>>>>>>>>>> disable, try to jstack the process
the seems to be hanging.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Finally, like I said in another
thread, there's a bug in 0.20.6 that
>>>>>>>>>>>>>>> almost prevent disabling a table
(or re-enabling) if any region
>>>>>>>>>>>>>>> recently split and the parent
wasn't cleaned yet from .META. and that
>>>>>>>>>>>>>>> is fixed in 0.90.1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 11:37
PM, Nanheng Wu <nanhengwu@gmail.com> wrote:
>>>>>>>>>>>>>>>> I think you are right, maybe
in the long run I need to re-architect my
>>>>>>>>>>>>>>>> system so that it doesn't
need to create new and delete old tables all
>>>>>>>>>>>>>>>> the time. In the short term
I am having a really hard time with the
>>>>>>>>>>>>>>>> disabling function, I ran
a disable command on a very small table
>>>>>>>>>>>>>>>> (probably dozen of MBs in
size) and are no client using the cluster at
>>>>>>>>>>>>>>>> all, and that took about
1 hour to complete! The weird thing is on the
>>>>>>>>>>>>>>>> web UI only the region server
carrying the META table has non-zero
>>>>>>>>>>>>>>>> requests, all other RS have
0 requests the entire time. I would think
>>>>>>>>>>>>>>>> they should get some request
to flush the memstore at least. I *am*
>>>>>>>>>>>>>>>> using the same RS nodes for
some map reduce job at the time and top
>>>>>>>>>>>>>>>> shows the memory usage is
almost full on the META region. Would you
>>>>>>>>>>>>>>>> have some idea of what I
should investigate?
>>>>>>>>>>>>>>>> Thanks so much.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message