hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars George <l...@worldlingo.com>
Subject Re: Low-cost and fast HTable.exists(...)?
Date Mon, 10 Nov 2008 08:31:22 GMT
Thanks Jonathan & Michael,

No, I do not need incrementing columns. All I need is to increment a 
counter of how many documents (cells) I have in a HBase table. The 
initial count comes from a MR job counting them all and setting up the 
counters in a memcachedb instance. Then when someone inserts a new 
document - as opposed to replacing an existing one - I want to increase 
that counter.

Overall it seems like I have to go with the get() == null method since 
everything else seems kind of way too much work for what I need. I just 
thought it would be easy to save some resources. Even doing the normal 
get() just not returning anything from the server but the boolean flag 
would have seemed a little bit better.

Thanks again,
Lars


Jonathan Gray wrote:
> Unfortunately, presence in a bloom filter does not guarantee its existence.  So if you
need to be absolutely certain, the bloom filter can only tell you whether it definitely does
not exist (or that it probably exists, with the false positive error rate based on the size
and parameters of the bloom).  If it is found in the bloom, you would still need to hit the
actual record to be sure.  Bloom filters make things faster when you miss or if you can tolerate
false-positives.
>
> You can of course save the expense of carrying the data through the network back to the
client, as an optimization.
>
> Perhaps we need to get more at the real use case you're talking about.  You want to insert
each new column in a row/family with an incrementing column name?  Do you just need the ordering
or do you need the indexes as well?  With your current design, do you start at 0 and keep
checking until you miss?  If so, you cannot tolerate false-positives and expect more hits
than misses; bloom filters will not help.
>
> I have done some work with incrementing ids here:  https://issues.apache.org/jira/browse/HBASE-803.
>
> Take a look at the code for incrementFamilyColumn(), should not be too difficult to modify
it to get the functionality you want.  This implementation would make the work transparent
to your client and much faster as things will be processed in the local regionserver to the
row, your latency/network cost would then only be for the put.  
>
> JG
>
> -----Original Message-----
> From: stack [mailto:stack@duboce.net] 
> Sent: Sunday, November 09, 2008 1:50 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Low-cost and fast HTable.exists(...)?
>
> Lars George wrote:
>   
>> Hi,
>>
>> I was wondering if there is a low cost (as in memory) and fast way to 
>> check if a certain cell already exists? I need to insert a cell, but 
>> based on if it was there before or not increase a counter (as in total 
>> number of entries in a table). 
>>     
>
> Does the count of elements have to be up-to-date?  Why not just scan the 
> table every hour or so to get a count?  (Scans are fast in 0.19.0. Seven 
> times faster than they were in 0.17.x and probably 100 times faster than 
> what they are in 0.1.3 -- smile).
>
>   
>> I see that HTable.get(...) returns the byte array, means there are 
>> memory, reading and network streaming involved. 
>>     
>
> Yes.
>
>   
>> So if I do a
>>
>> if (table.get(row, col) == null) { incr(counter); }
>> table.put(...);
>>
>> this seems like a waste of resources and may not be as fast as a true
>>
>> if (!table.exists(row, col))  { incr(counter); }
>> table.put(...)
>>
>>     
> Its tough.  Ideal would be a bloom filter on the column.  You'd check 
> for presence of a Cell in bloom filter.  It'd come back yes/no.  Would 
> be an in-memory test but would involve a network trip (Maybe have a 
> client-side bloomfilter too?  So, if exists, would save the network trip?).
>
> The hard part about bloom filter though is that you would have specify 
> exact coordinates as in exact row/column/timestamp.  The row/column part 
> is easy but the timestamp less-so.  When you insert, you probably do not 
> specify a timestamp letting the system set the timestamp to now.  If you 
> then want to test existence in a bloomfilter, how you going to do it if 
> you don't have the exact timestamp.  So, you end up using the hbase 
> get(row, column) because it will return the latest insert if no 
> timestamp specified.
>
> Otherwise, looks like you would be happy with a bloomfilter that just 
> recorded the row and column and not timestamp.  That'd work.   I think 
> this is how bloomfilters work now in latest hbase.  We need to check.  
> They used to be row/column/timestamp (They are broken till we release 
> 0.19.0 though -- in about a month).
>   
>> It looks like this is easily doable since get() also delegates to the 
>> region servers.
>>
>>     
>
>   
>> Am I missing something? Assuming HTable is sort of a Set 
>> implementation I am confused as to way this check is missing.
>>
>>     
> Well, its not that straight-forward.  The only place to check presence 
> of a column is by actually asking hbase and letting it check its 
> memcache and then all of its storefiles.  This is only way to see if a 
> row/column combination exists.  There is no short-circuit, say, a Set 
> that holds all row/column combinations because it could be massive if a 
> row had millions of columns (nothing to prevent this happening).
>
> Is the lookup taking too long?  In 0.19.0, the speeds are all up.  There 
> is a cache of file blocks maintained in the server.  If you can hit the 
> cache, then you can see lookup rates double and even quadruple.
>
> Would a bloom filter in your client work help?
>
> St.Ack
>
>   

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message