hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Hamilton <rhamil...@whalesharkmedia.com>
Subject Flume EventSerializer vs hbase coprocessor
Date Mon, 01 Apr 2013 21:31:09 GMT
I have a calculation that I'm doing in a custom AsyncHbaseEventSerializer. I want to do the
calculation in real time, but it looks like it could be done either here or in a coprocessor.
I'm just doing it in the serializer for now because the code is simple that way, and data
only ever will come in through flume anyway.

But is this good practice?  I would welcome any advice or guidance.

A simplified version of the calculation: 

Every row has a groupID and a data timestamp field; each groupID represents a distinct group
of rows and the timestamp distinguishes between individual rows in the group. We can assume
the combination is always unique. So I construct the rowkey as concatenated groupID, '.' ,
and reverse timestamp.

The task I have, for each such row to be inserted into HBase, find the latest row already
inserted having the same groupID (based on timestamp part of the key),  and insert another
column having the difference between its time and that of the previous record.  

Each row the serializer sees, it looks up the previous row using a scan and gets the first
row from the scan (thats why I'm using the reverse timestamp).  Finds the difference and adds
that to the list of PutRequests.

Example:  the data having 2 rows looks like this:

gggg,123456, 'hello'
gggg,123400, 'there'

Result in hbase would look like this.

Row: gggg.123456 , 
	cf:v = 'hello'
        cf:dt = null             <--- no previous row so dt is null

Row: gggg.123400, 
        cf:dt=56                 <-- dt is 56 ms from 123456 - 123400

As shown, I've calculated the dt field from the previous record.  The dt=56 means this record
came from an event that was logged 56 ms later than the first one.

Is this a common practice, or am I crazy to be doing this in the serializer? Are there performance
or reliability issues that I should be considering?

This e-mail, including attachments, contains confidential and/or 
proprietary information, and may be used only by the person or entity to 
which it is addressed. The reader is hereby notified that any 
dissemination, distribution or copying of this e-mail is prohibited. If you 
have received this e-mail in error, please notify the sender by replying to 
this message and delete this e-mail immediately.

View raw message