hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Meil <doug.m...@explorysmedical.com>
Subject Re: Lease does not exist exceptions
Date Thu, 27 Oct 2011 17:21:50 GMT

I'll add something in the docs.


On 10/27/11 3:35 AM, "Lucian Iordache" <lucian.george.iordache@gmail.com>
wrote:

>Yep. did not work entirely.
>
>I had a job to run on 1000 regions. And the caching was 200. The job
>crashed
>with a lot of ClosedChannelExceptions + LeaseExceptions.
>
>Set the caching to 10 ==> the same.
>Set the caching to 1 ==> ~600 successfully completed tasks, but still a
>lot
>of them crashed ==> job crashed
>Set the hbase.rpc.timeout to 240000 (which is the lease timeout on the
>region server) ==> the job completed successfully, without any failed
>attempts.
>
>The problem was that we have some very large regions (2GB) and there are
>some of them with very few data, that's why it takes more than 60 seconds
>to
>get even the first row. As Daniel said, in the documentation of the lease
>timeout for regionserver and the hbase.rpc.timeout should be mentioned to
>be
>careful when modifying them, because you can get to problems, like in our
>case.
>
>Regards,
>Lucian
>
>On Wed, Oct 26, 2011 at 7:53 PM, Jean-Daniel Cryans
><jdcryans@apache.org>wrote:
>
>> Did you try setting the scanner caching down like I mentioned?
>>
>> J-D
>>
>> On Wed, Oct 26, 2011 at 8:48 AM, Lucian Iordache
>> <lucian.george.iordache@gmail.com> wrote:
>> > Problem solved. It was like I said, the server took more than the
>> > hbase.rpc.timeout to run the call and the client closed the
>>connection.
>> >
>> > Best Regards,
>> > Lucian
>> >
>> > On Tue, Oct 25, 2011 at 11:15 AM, Lucian Iordache <
>> > lucian.george.iordache@gmail.com> wrote:
>> >
>> >> Yes, I will try to see the SocketTimeoutException after putting log
>>on
>> >> debug, because, like it says here
>> >> https://issues.apache.org/jira/browse/HBASE-3154 , this is logged on
>> debug
>> >> on the client side.
>> >>
>> >> Regards,
>> >> Lucian
>> >>
>> >>
>> >> On Mon, Oct 24, 2011 at 8:22 PM, Jean-Daniel Cryans <
>> jdcryans@apache.org>wrote:
>> >>
>> >>> So you should see the SocketTimeoutException in your *client* logs
>>(in
>> >>> your case, mappers), not LeaseException. At this point yes you're
>> >>> going to timeout, but if you spend so much time cycling on the
>>server
>> >>> side then you shouldn't set a high caching configuration on your
>> >>> scanner as IO isn't your bottle neck.
>> >>>
>> >>> J-D
>> >>>
>> >>> On Mon, Oct 24, 2011 at 10:15 AM, Lucian Iordache
>> >>> <lucian.george.iordache@gmail.com> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > The servers have been restarted (I have this configuration for
>>more
>> than
>> >>> a
>> >>> > month, so this is not the problem).
>> >>> > About the stack traces, they show exactly the same, a lot of
>> >>> > ClosedChannelConnections and LeaseExceptions.
>> >>> >
>> >>> > But I found something that could be the problem:
>>hbase.rpc.timeout .
>> >>> This
>> >>> > defaults to 60 seconds, and I did not modify it in
>>hbase-site.xml. So
>> it
>> >>> > could happen the next way:
>> >>> > - the mapper makes a scanner.next call to the region server
>> >>> > - the region servers needs more than 60 seconds to execute it (I
>>use
>> >>> > multiple filters, and it could take a lot of time)
>> >>> > - the scan client gets the timeout and cuts the connection
>> >>> > - the region server tries to send the results to the client ==>
>> >>> > ClosedChannelConnection
>> >>> >
>> >>> > I will get a deeper look into it tomorrow. If you have other
>> >>> suggestions,
>> >>> > please let me know!
>> >>> >
>> >>> > Thanks,
>> >>> > Lucian
>> >>> >
>> >>> > On Mon, Oct 24, 2011 at 8:00 PM, Jean-Daniel Cryans <
>> >>> jdcryans@apache.org>wrote:
>> >>> >
>> >>> >> Did you restart the region servers after changing the config?
>> >>> >>
>> >>> >> Are you sure it's the same exception/stack trace?
>> >>> >>
>> >>> >> J-D
>> >>> >>
>> >>> >> On Mon, Oct 24, 2011 at 8:04 AM, Lucian Iordache
>> >>> >> <lucian.george.iordache@gmail.com> wrote:
>> >>> >> > Hi all,
>> >>> >> >
>> >>> >> > I have exactly the same problem that Eran had.
>> >>> >> > But there is something I don't understand: in my case,
I have
>>set
>> the
>> >>> >> lease
>> >>> >> > time to 240000 (4 minutes). But most of the map tasks
that are
>> >>> failing
>> >>> >> run
>> >>> >> > about 2 minutes. How is it possible to get a LeaseException
if
>>the
>> >>> task
>> >>> >> runs
>> >>> >> > less than the configured time for a lease?
>> >>> >> >
>> >>> >> > Regards,
>> >>> >> > Lucian Iordache
>> >>> >> >
>> >>> >> > On Fri, Oct 21, 2011 at 12:34 AM, Eran Kutner <eran@gigya.com>
>> >>> wrote:
>> >>> >> >
>> >>> >> >> Perfect! Thanks.
>> >>> >> >>
>> >>> >> >> -eran
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Thu, Oct 20, 2011 at 23:27, Jean-Daniel Cryans
<
>> >>> jdcryans@apache.org
>> >>> >> >> >wrote:
>> >>> >> >>
>> >>> >> >> > hbase.regionserver.lease.period
>> >>> >> >> >
>> >>> >> >> > Set it bigger than 60000.
>> >>> >> >> >
>> >>> >> >> > J-D
>> >>> >> >> >
>> >>> >> >> > On Thu, Oct 20, 2011 at 2:23 PM, Eran Kutner
>><eran@gigya.com>
>> >>> wrote:
>> >>> >> >> > >
>> >>> >> >> > > Thanks J-D!
>> >>> >> >> > > Since my main table is expected to continue
growing I
>>guess
>> at
>> >>> some
>> >>> >> >> point
>> >>> >> >> > > even setting the cache size to 1 will not
be enough. Is
>>there
>> a
>> >>> way
>> >>> >> to
>> >>> >> >> > > configure the lease timeout?
>> >>> >> >> > >
>> >>> >> >> > > -eran
>> >>> >> >> > >
>> >>> >> >> > >
>> >>> >> >> > >
>> >>> >> >> > > On Thu, Oct 20, 2011 at 23:16, Jean-Daniel
Cryans <
>> >>> >> jdcryans@apache.org
>> >>> >> >> > >wrote:
>> >>> >> >> > >
>> >>> >> >> > > > On Wed, Oct 19, 2011 at 12:51 PM, Eran
Kutner <
>> eran@gigya.com
>> >>> >
>> >>> >> >> wrote:
>> >>> >> >> > > >
>> >>> >> >> > > > > Hi J-D,
>> >>> >> >> > > > > Thanks for the detailed explanation.
>> >>> >> >> > > > > So if I understand correctly the
lease we're talking
>> about
>> >>> is a
>> >>> >> >> > scanner
>> >>> >> >> > > > > lease and the timeout is between
two scanner calls,
>> correct?
>> >>> I
>> >>> >> >> think
>> >>> >> >> > that
>> >>> >> >> > > > > make sense because I now realize
that jobs that fail
>> (some
>> >>> jobs
>> >>> >> >> > continued
>> >>> >> >> > > > > to
>> >>> >> >> > > > > fail even after reducing the number
of map tasks as
>>Stack
>> >>> >> >> suggested)
>> >>> >> >> > use
>> >>> >> >> > > > > filters to fetch relatively few
rows out of a very
>>large
>> >>> table,
>> >>> >> so
>> >>> >> >> > they
>> >>> >> >> > > > > could be spending a lot of time
on the region server
>> >>> scanning
>> >>> >> rows
>> >>> >> >> > until
>> >>> >> >> > > > it
>> >>> >> >> > > > > reached my setCaching value which
was 1000. Setting
>>the
>> >>> caching
>> >>> >> >> value
>> >>> >> >> > to
>> >>> >> >> > > > 1
>> >>> >> >> > > > > seem to allow these job to complete.
>> >>> >> >> > > > > I think it has to be the above,
since my rows are
>>small,
>> >>> with
>> >>> >> just
>> >>> >> >> a
>> >>> >> >> > few
>> >>> >> >> > > > > columns and processing them is
very quick.
>> >>> >> >> > > > >
>> >>> >> >> > > >
>> >>> >> >> > > > Excellent!
>> >>> >> >> > > >
>> >>> >> >> > > >
>> >>> >> >> > > > >
>> >>> >> >> > > > > However, there are still a couple
ofw thing I don't
>> >>> understand:
>> >>> >> >> > > > > 1. What is the difference between
setCaching and
>> setBatch?
>> >>> >> >> > > > >
>> >>> >> >> > > >
>> >>> >> >> > > > * Set the maximum number of values
to return for each
>>call
>> to
>> >>> >> next()
>> >>> >> >> > > >
>> >>> >> >> > > > VS
>> >>> >> >> > > >
>> >>> >> >> > > > * Set the number of rows for caching
that will be
>>passed to
>> >>> >> scanners.
>> >>> >> >> > > >
>> >>> >> >> > > > The former is useful if you have rows
with millions of
>> columns
>> >>> and
>> >>> >> >> you
>> >>> >> >> > > > could
>> >>> >> >> > > > setBatch to get only 1000 of them at
a time. You could
>>call
>> >>> that
>> >>> >> >> > intra-row
>> >>> >> >> > > > scanning.
>> >>> >> >> > > >
>> >>> >> >> > > >
>> >>> >> >> > > > > 2. Examining the region server
logs more closely than
>>I
>> did
>> >>> >> >> yesterday
>> >>> >> >> > I
>> >>> >> >> > > > see
>> >>> >> >> > > > > a log of ClosedChannelExceptions
in addition to the
>> expired
>> >>> >> leases
>> >>> >> >> > (but
>> >>> >> >> > > > no
>> >>> >> >> > > > > UnknownScannerException), is that
expected? You can
>>see
>> an
>> >>> >> excerpt
>> >>> >> >> of
>> >>> >> >> > the
>> >>> >> >> > > > > log from one of the region servers
here:
>> >>> >> >> > http://pastebin.com/NLcZTzsY
>> >>> >> >> > > >
>> >>> >> >> > > >
>> >>> >> >> > > > It means that when the server got to
process that client
>> >>> request
>> >>> >> and
>> >>> >> >> > > > started
>> >>> >> >> > > > reading from the socket, the client
was already gone.
>> Killing
>> >>> a
>> >>> >> >> client
>> >>> >> >> > does
>> >>> >> >> > > > that (or killing a MR that scans),
so does
>> >>> SocketTimeoutException.
>> >>> >> >> This
>> >>> >> >> > > > should probably go in the book. We
should also print
>> something
>> >>> >> nicer
>> >>> >> >> :)
>> >>> >> >> > > >
>> >>> >> >> > > > J-D
>> >>> >> >> > > >
>> >>> >> >> >
>> >>> >> >>
>> >>> >> >
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Numai bine,
>> >> Lucian
>> >>
>> >
>> >
>> >
>> > --
>> > Numai bine,
>> > Lucian
>> >
>>
>
>
>
>-- 
>Numai bine,
>Lucian



Mime
View raw message