hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Marron <petermar...@discover.com>
Subject Avoiding duplicate writes
Date Thu, 11 Jan 2018 10:16:49 GMT

We have a problem when we are writing lots of records to HBase.
We are not specifying timestamps explicitly and so the situation arises where multiple records
are being written in the same millisecond.
Unfortunately when the records are written and the timestamps are the same then later writes
are treated as updates of the previous records and not separate records, which is what we
So we want to be able to guarantee that records are not treated as overwrites (unless we explicitly
make them so).

As I understand it there are (at least) two different ways to proceed.

The first approach is to increase the resolution of the timestamp.
So we could use something like java.lang.System.nanoTime()
However although this seems to ameliorate the problem it seems to introduce other problems.
Also ideally we would like something that guarantees that we don't lose writes rather than
making them more unlikely.

The second approach is to write a prePut co-processor.
In the prePut I can do a read using the same rowkey, column family and column qualifier and
omit the timestamp.
As I understand it this will return me the latest timestamp.
Then I can update the timestamp that I am going to write, if necessary, to make sure that
the timestamp is always unique.
In this way I can guarantee that none of my writes are accidentally turned into updates.

However this approach seems to be expensive.
I have to do a read before each write, and although (I believe) it will be on the same region
server, it's still going to slow things down a lot.
Also I am assuming that the prePut co-processor is executed inside a record lock so that I
don't have to worry about synchronization.
Is this true?

Is there a better way?

Maybe there is some implementation of this already that I can pick up?

Maybe there is some way that I can implement this more efficiently?

It seems to me that this might be better handled at compaction.
Shouldn't there be some way that I can mark writes with some sort of special value of timestamp
that means that this write should never be considered as an update but always as a separate

Any advice gratefully received.

Peter Marron

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message