hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: snapshot timeout problem
Date Tue, 22 Jul 2014 15:28:35 GMT
Here is code snippet from StochasticLoadBalancer
w.r.t. TableSkewCostFunction :

    private static final String TABLE_SKEW_COST_KEY =

        "hbase.master.balancer.stochastic.tableSkewCost";

    private static final float DEFAULT_TABLE_SKEW_COST = 35;

    TableSkewCostFunction(Configuration conf) {

      super(conf);

      this.setMultiplier(conf.getFloat(TABLE_SKEW_COST_KEY,
DEFAULT_TABLE_SKEW_COST));

You can try increasing the value for
"hbase.master.balancer.stochastic.tableSkewCost"


Cheers


On Tue, Jul 22, 2014 at 6:59 AM, Brian Jeltema <
brian.jeltema@digitalenvoy.net> wrote:

> I don’t understand the logging output, but I do see a strange pattern.
> I’ll try to summarize.
>
> There are 5 RegionServers, call them rs1 through rs5. There are a total of
> 174 regions for the table in question,
> with 69 in rs1. In the log output I see lines (greatly simplified) like
> the following:
>
>    AssignmentManager: Assigning fooTable, …. to rs2
>    AssignmentManager: Assigning fooTable, …. to rs3
>    AssignmentManager: Assigning fooTable, …. to rs4
>    AssignmentManager: Assigning fooTable, …. to rs5
>
> There are 106 such lines, none logging an assignment to rs1
>
> I also see 105 lines like:
>
>   AssignmentManager: Using pre-existing plan for fooTable … src=rs1 …
> dest=rs2
>   AssignmentManager: Using pre-existing plan for fooTable … src=rs1 …
> dest=rs3
>   …
>
> where src=rs1 in every case, and dest=rs1 never occurs.
>
> I don’t see any exceptions or log output that reports a problem.
>
>
> On Jul 22, 2014, at 9:18 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > The load balancer in 0.98 considers many factors when making balancing
> decisions.
> >
> > Can you take a look at the master log and look for balancer related
> lines ?
> > That would give you some clue.
> >
> > Cheers
> >
> > On Jul 22, 2014, at 5:03 AM, Brian Jeltema <
> brian.jeltema@digitalenvoy.net> wrote:
> >
> >> I ran the balancer from hbase shell, but don’t see any change. Is there
> a way to balance a specific table?
> >>
> >>> bq. One RegionServer has 69 regions
> >>>
> >>> Can you run load balancer so that your regions are better balanced ?
> >>>
> >>> Cheers
> >>>
> >>>
> >>> On Mon, Jul 21, 2014 at 6:56 AM, Brian Jeltema <
> >>> brian.jeltema@digitalenvoy.net> wrote:
> >>>
> >>>> There are 174 regions, not well balanced. One RegionServer has 69
> regions.
> >>>> That RegionServer generates a
> >>>> series of log entries (modified and shown below), one for each
> region, at
> >>>> roughly 1 to 2 second intervals. The timeout period expires when
> >>>> it reaches region 36.
> >>>>
> >>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating references for
> >>>> hfiles
> >>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Adding snapshot
> references
> >>>> for [hdfs://
> >>>>
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
> ]
> >>>> hfiles
> >>>> 2014-07-21 07:49:44,503 regionserver.HRegion: Creating reference for
> file
> >>>> (1/1) : hdfs://
> >>>>
> xxx.digitalenvoy.net:8020/apps/hbase/data/data/default/hosts/31e2a098e9e311c4ddcfd3d8da28dfb6/p/3749b6df36c749508fe9c6f54ca425f2
> >>>> 2014-07-21 07:49:45,136 snapshot.FlushSnapshotSubprocedure: ... Flush
> >>>> Snapshotting region
> >>>> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.
> >>>> completed.
> >>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Closing
> region
> >>>> operation on
> >>>>
> hosts,\x00\x03|\xBF!,1400600029600.31e2a098e9e311c4ddcfd3d8da28dfb6.2014-07-21
> >>>> 07:49:45,137 DEBUG [rs(xxx.digitalenvoy.net
> ,60020,1405943192177)-snapshot-pool3-thread-1]
> >>>> snapshot.FlushSnapshotSubprocedure: Starting region operation on
> >>>> hosts,\x00\x8A\x90\xD6\x08,1400
> >>>> 659179080.a74402fcbd9a96a7c92b250721095729.2014-07-21 07:49:45,137
> DEBUG
> >>>> [member: ‘xxx.digitalenvoy.net,60020,1405943192177'
> >>>> subprocedure-pool1-thread-2] snapshot.RegionServerSnapshotManager:
> >>>> Completed 1/174 local region snapshots.
> >>>> 2014-07-21 07:49:45,137 snapshot.FlushSnapshotSubprocedure: Flush
> >>>> Snapshotting region
> >>>>
> hosts,\x00\x8A\x90\xD6\x08,1400659179080.a74402fcbd9a96a7c92b250721095729.
> >>>> started...
> >>>> 2014-07-21 07:49:45,137 regionserver.HRegion: Storing region-info for
> >>>> snapshot.
> >>>>
> >>>> On Jul 21, 2014, at 9:21 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org>
> >>>> wrote:
> >>>>
> >>>>> Can you also tell us more about your table? How many regions on
how
> many
> >>>>> region servers?
> >>>>>
> >>>>>
> >>>>> 2014-07-21 8:23 GMT-04:00 Ted Yu <yuzhihong@gmail.com>:
> >>>>>
> >>>>>> Normally such timeout is caused by one region server which is
slow
> in
> >>>>>> completing its part of the snapshot procedure.
> >>>>>>
> >>>>>> Have you looked at region server logs ?
> >>>>>> Feel free to pastebin relevant portion.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> On Jul 21, 2014, at 4:03 AM, Brian Jeltema <
> >>>> brian.jeltema@digitalenvoy.net>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I’m running HBase 0.98. I’m trying to snapshot a table,
but it’s
> timing
> >>>>>> out after 60 seconds.
> >>>>>>> I increased the value of hbase.snapshot.master.timeoutMillis
and
> >>>>>> restarted HBase,
> >>>>>>> but the timeout still happens after 60 seconds. Any suggestions?
> >>>>>>>
> >>>>>>> Brian
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message