hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Meil <doug.m...@explorysmedical.com>
Subject Re: HBase, Hive, Hive over HBase or Pig over HBase
Date Thu, 27 Oct 2011 21:02:25 GMT

It would look something like this...

http://hbase.apache.org/book.html#mapreduce.example.summary

... Except your output would be to an RDBMS, instead of HBase.


On 10/27/11 2:51 PM, "viva v" <vivamailers@gmail.com> wrote:

>Thanks Doug.
>
>30 million is the size to start with, growth rate is about 1 million per
>week
>
>You mention HBase being used to generate summaies into an RDBMS, i am not
>quite sure i understood this approach very well.
>How would you generate the summaries from raw HBase data & update into a
>RDBMS, would we need to accomplish this using a mapreduce job maybe?
>
>Could you please point me to an example use case scenario that has taken
>this approach?
>
>Thanks
>Vivek
>
>On Thu, Oct 27, 2011 at 1:27 AM, Doug Meil
><doug.meil@explorysmedical.com>wrote:
>
>>
>> re: "30 million records."
>>
>> We're obviously pro-HBase on this dist-list but one of the challenges of
>> HBase (and Hadoop in general) is that the architecture can tend to be
>> overkill on smaller datasets.  That doesn't mean you shouldn't try
>>HBase,
>> but expectations should be tempered.
>>
>>
>> Especially with your requirements #5 and #6, RDBMS are actually pretty
>> good at that for smaller volumes, which is why HBase tends to be used to
>> generate summaries into RDBMSs for further slicing and dicing.
>>
>> If you had an arrival rate of 30 million a day or something, then it
>>would
>> be a different story.
>>
>>
>> On 10/26/11 3:31 PM, "viva v" <vivamailers@gmail.com> wrote:
>>
>> >Hi,
>> >
>> >I am working on a use case that has the following characteristics.
>> >1) Data volume is in the order 30 million records
>> >2) Data schema is known & is fixed (for the application we are
>>building)
>> >3) Data is NOT multi format. A single key will have integer data for
>> >different aspects of that key
>> >4) Data will be incrementally updated (some column values will be
>>updated
>> >at
>> >different points of time)
>> >5) There is a need to support adhoc (queries are not known ahead of
>>time)
>> >querying of data (without writing map reduce jobs)
>> >6) Queries are likely to have a lot of joins & aggregations
>> >
>> >Could you please help me with suggestions on whether i should use
>> >1) Hive
>> >2) HBase
>> >3) Hive over HBase
>> >4) Pig over HBase
>> >
>> >Thanks
>> >Vivek
>>
>>



Mime
View raw message