hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Miguel Costa <miguel-co...@telecom.pt>
Subject RE: HBase design schema
Date Mon, 04 Apr 2011 16:37:14 GMT
Ted thanks for your help.

 

I considered the last option that you mentioned , "pushing one of you r
dimension to the key".

 

With that I can have results for that single dimension: For example key:
Time+Site+Referrer

But if I want now the top Keywords (where top can be any metric) of that
Key. Should I have another table with this key: Time+Site+Referrer+Keyword ?

And If I have 30 more dimensions and I want to cross all over them. The
number of tables will grow exponencially (dimension* the number of available
dimensions to cross). And this can be into several level for example to
level 5 Time+Site+Referrer+Keyword+Dim4 +Dim5.

And the time to update those tables maybe will be a lot.

 

In Hive it is possible to make this queries if I have this dimensions on
columns but the problem is that I need results on 3 seconds. 

 

Other option that I thought was to have the cross dimensions as
columnFamilies. For example key: Time+Site+Referrer and Column Family
Keyword: MyKeyword where the value could be the metrics that I need
separated by "\t".


 

What do you think is the best approach?

 

Thanks,

 


cid:image001.jpg@01CAE723.6653A5B0


 

Logo_pt_verde


  


Miguel Costa


DTS - Sapo Technology Department 
Web Analytics 
Tm: +351 92 672 60 54
 <mailto:miguel.costa@telecom.pt> miguel.costa@telecom.pt

 

 

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: segunda-feira, 4 de Abril de 2011 17:25
To: user@hbase.apache.org
Cc: Miguel Costa
Subject: Re: HBase design schema

 

 

Miguel,

 

One option is to use the simplest design and use the key you have.  Scanning
for a particular period of time will give you all the data in that time
period which you can reduce in any way that you like.

 

If that becomes too inefficient, a common trick is to build a secondary file
that contains aggregated data at lower time resolution.

 

Another trick is to copy your original table pushing one of your dimension
into the key.  That will help by preventing you from scanning through data
you don't care about.  The space consumed is not so far off what an index in
a conventional database would consume.

 

In general, it is important to keep in mind that Hbase doesn't have
conventional relational indexes so lots of the design considerations that
motivate star schemas don't really apply.

On Mon, Apr 4, 2011 at 9:12 AM, Miguel Costa <miguel-costa@telecom.pt>
wrote:

Hi,

 

I need some help to a schema design on HBase.

 

I have 5 dimensions (Time,Site,Referrer Keyword,Country).

My row key is Site+Time.

 

Now I want to answer some questions like what is the top Referrer by Keyword
for a site on a Period of Time.

Basically I want to cross all the dimensions that I have. And if I have 30
dimensions?

 

What is the best schema design. 

 

Please let me know  if this isn't the right mailing list.

 

Thank you for your time.

 

Miguel

 

 

 


Mime
View raw message