hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Normand <alexandre.norm...@gmail.com>
Subject Re: Seeking advice on skipped/lost data during data migration from and to a hbase table
Date Tue, 07 Feb 2017 21:41:46 GMT
I agree we should upgrade to get this fix and I moved this to a cloudera
support case to get a patched release. I'm hoping that maybe we can get
more confidence that this is it but I'm moving this to cloudera support.

Thanks for the help!

On Tue, Feb 7, 2017 at 12:45 PM Ted Yu <yuzhihong@gmail.com> wrote:

> There is not much in the log which would indicate the trigger of this bug.
>
> From the information presented on this thread, it is highly likely that
> once you deploy build with the fix from HBASE-15378, you would get
> consistent result from your map tasks.
>
> I suggest you arrange upgrade of your cluster soon.
>
> Cheers
>
> On Tue, Feb 7, 2017 at 10:34 AM, Alexandre Normand <
> alexandre.normand@gmail.com> wrote:
>
> > Thanks for the correction, Sean.
> >
> > I'm thinking of trying to reproduce the problem on a non-production
> cluster
> > using the same migration job that I was talking about in my original post
> > (we have similar data as production on a non-prod cluster) but then, I'm
> > not sure how to validate that what we're experiencing is related to that
> > bug. Ideally, we'd have some hint using scanner client or region server
> > logs but I haven't seen anything from looking at HBASE-13090.
> >
> > Did I miss something that could be useful?
> >
> >
> >
> > On Tue, Feb 7, 2017 at 10:09 AM Sean Busbey <busbey@apache.org> wrote:
> >
> > > HBASE-15378 says that it was caused by HBASE-13090, I think.
> > >
> > > That issue is present in CDH5.5.4:
> > >
> > >
> > > http://archive.cloudera.com/cdh5/cdh/5/hbase-1.0.0-cdh5.5.
> > 4.releasenotes.html
> > >
> > > (Search in page for HBASE-13090)
> > >
> > > On Tue, Feb 7, 2017 at 11:51 AM, Alexandre Normand
> > > <alexandre.normand@gmail.com> wrote:
> > > > Reporting back with some results.
> > > >
> > > > We ran several RowCounters and each one gives us the same count back.
> > It
> > > > could be because RowCounter is much more lightweight than our
> migration
> > > job
> > > > (which reads every cell and turns back to write an equivalent version
> > in
> > > > another table) but it's hard to tell.
> > > >
> > > > Taking a step back, it looks like the bug described in HBASE-15378
> was
> > > > introduced in 1.1.0 which wouldn't affect us since we're still
> > > > on 1.0.0-cdh5.5.4.
> > > >
> > > > I guess that puts us back to square one. Any other ideas?
> > > >
> > > > On Sun, Feb 5, 2017 at 1:10 PM Alexandre Normand <
> > > > alexandre.normand@gmail.com> wrote:
> > > >
> > > >> That's a good suggestion. I'll give that a try.
> > > >>
> > > >> Thanks again!
> > > >>
> > > >> On Sun, Feb 5, 2017 at 1:07 PM Ted Yu <yuzhihong@gmail.com>
wrote:
> > > >>
> > > >> You can run rowcounter on the source tables multiple times.
> > > >>
> > > >> With region servers under load, you would observe inconsistent
> results
> > > from
> > > >> different runs.
> > > >>
> > > >> On Sun, Feb 5, 2017 at 12:54 PM, Alexandre Normand <
> > > >> alexandre.normand@gmail.com> wrote:
> > > >>
> > > >> > Thanks, Ted. We're running HBase 1.0.0-cdh5.5.4 which isn't in
the
> > > fixed
> > > >> > versions so this might be related. This is somewhat reassuring
to
> > > think
> > > >> > that this would be missed data on the scan/source side because
> this
> > > would
> > > >> > mean that our other ingest/write workloads wouldn't be affected.
> > > >> >
> > > >> > From reading the jira description, it sounds like it would be
> > > difficult
> > > >> to
> > > >> > confirm that we've been affected by this bug. Am I right?
> > > >> >
> > > >> > On Sun, Feb 5, 2017 at 12:36 PM Ted Yu <yuzhihong@gmail.com>
> wrote:
> > > >> >
> > > >> > > Which release of hbase are you using ?
> > > >> > >
> > > >> > > To be specific, does the release have HBASE-15378 ?
> > > >> > >
> > > >> > > Cheers
> > > >> > >
> > > >> > > On Sun, Feb 5, 2017 at 11:32 AM, Alexandre Normand <
> > > >> > > alexandre.normand@gmail.com> wrote:
> > > >> > >
> > > >> > > > We're migrating data from a previous iteration of a
table to a
> > new
> > > >> one
> > > >> > > and
> > > >> > > > this process involved a MR job that scans data from
the source
> > > table
> > > >> > and
> > > >> > > > writes the equivalent data in the new table. The source
table
> > has
> > > >> 6000+
> > > >> > > > regions and it frequently splits because we're still
ingesting
> > > time
> > > >> > > series
> > > >> > > > data into it. We used buffered writing on the other
end when
> > > writing
> > > >> to
> > > >> > > the
> > > >> > > > new table and we have a yarn resource pool to limit
the
> > concurrent
> > > >> > > writing.
> > > >> > > >
> > > >> > > > First, I should say that this job took a long time
but still
> > > mostly
> > > >> > > worked.
> > > >> > > > However, we've built a mechanism to compare requested
data
> > fetched
> > > >> from
> > > >> > > > each one of the tables and found that some rows (0.02%)
are
> > > missing
> > > >> > from
> > > >> > > > the destination. We've ruled out a few things already:
> > > >> > > >
> > > >> > > > * Functional bug in the job that would have resulted
in
> skipping
> > > that
> > > >> > > 0.02%
> > > >> > > > of the rows.
> > > >> > > > * Potential for that data not having existed when the
> migration
> > > job
> > > >> > > > initially ran.
> > > >> > > >
> > > >> > > > At a high-level, the suspects could be:
> > > >> > > >
> > > >> > > > * The source table splitting could have resulted in
some input
> > > keys
> > > >> not
> > > >> > > > being read. However, since a hbase split is comprised
of a
> > > >> > > startKey/endKey,
> > > >> > > > this seems like this would not be expected unless there
was a
> > bug
> > > in
> > > >> > > there
> > > >> > > > somehow.
> > > >> > > > * The writing/flushing losing a batch. Since we're
buffering
> > > writes
> > > >> and
> > > >> > > > flush everything on the clean up of map tasks, we would
expect
> > > write
> > > >> > > > failures to cause task failures/retries and therefore
to not
> be
> > a
> > > >> > problem
> > > >> > > > in the end. Given that this flush is synchronous and,
> according
> > to
> > > >> our
> > > >> > > > understanding, completes when the data is in the WAL
and
> > memstore,
> > > >> this
> > > >> > > > also seems unlikely unless there's a bug.
> > > >> > > >
> > > >> > > > I should add that we've extracted a sample of 1% of
the source
> > > rows
> > > >> > > (doing
> > > >> > > > all of them is really time consuming because of the
size of
> > data)
> > > and
> > > >> > > found
> > > >> > > > that missing data often appears in clusters of the
source
> hbase
> > > row
> > > >> > keys.
> > > >> > > > This doesn't really help pointing at a problem with
the scan
> > side
> > > of
> > > >> > > things
> > > >> > > > or the write side of things (since a failure in either
would
> > > result
> > > >> in
> > > >> > a
> > > >> > > > similar output) but we thought it was interesting.
That said,
> we
> > > do
> > > >> > have
> > > >> > > a
> > > >> > > > few keys that are missing that aren't clustered. This
could be
> > > >> because
> > > >> > > > we've only ran the comparison for 1% of the data or
it could
> be
> > > that
> > > >> > > > whatever is causing this can affect very isolated cases.
> > > >> > > >
> > > >> > > > We're now trying to understand how this could have
happened in
> > > order
> > > >> to
> > > >> > > > understand how it could impact other jobs/applications
and
> also
> > to
> > > >> > > increase
> > > >> > > > our confidence that we write a modified version of
the
> migration
> > > job
> > > >> to
> > > >> > > > re-migrate the skipped/missing data.
> > > >> > > >
> > > >> > > > Any ideas or advice would be much appreciated.
> > > >> > > >
> > > >> > > > Thanks!
> > > >> > > >
> > > >> > > > --
> > > >> > > > Alex
> > > >> > > >
> > > >> > >
> > > >> > --
> > > >> > Alex
> > > >> >
> > > >>
> > > >> --
> > > >> Alex
> > > >>
> > > > --
> > > > Alex
> > >
> > --
> > Alex
> >
>
-- 
Alex

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message