hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleks Laz <al-userhb...@none.at>
Subject Re: Newbie Question about 37TB binary storage on HBase
Date Thu, 27 Nov 2014 23:32:03 GMT
Hi Wilm.

Am 27-11-2014 23:41, schrieb Wilm Schumacher:
> Hi Aleks ;),
> Am 27.11.2014 um 22:27 schrieb Aleks Laz:
>> Our application is a nginx/php-fpm/postgresql Setup.
>> The target design is nginx + proxy features / php-fpm / $DB / 
>> $Storage.
>> .) Can I mix HDFS /HBase for binary data storage and data analyzing?
> yes. hbase is perfect for that. For storage it will work (with the
> "MOB-extension") and with map reduce you can do whatever data analyzing
> you want. I assume you do some image processing with the data?!?!

What's the plan about the "MOB-extension"?

 From development point of view I can build HBase with the 
but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) is 
easier to maintain.

Currently there are no plans to analyse the images, but who knows what 
future brings.

We need to make some "accesslog" analyzing like piwik or awffull.
Maybe elasticsearch is a better tool for that?

>> .) What is the preferred way to us HBase  with PHP?
> The native client lib is in java. This is the best way to go. But if 
> you
> need only basic access from the php application, then thrift or rest
> would be a good choice.
> http://wiki.apache.org/hadoop/Hbase/ThriftApi
> http://wiki.apache.org/hadoop/Hbase/Stargate

Stargate is a cool name ;-)

> There are language bindings for both
>> .) How difficult is it to use HBase with PHP?
> Depending on what you are trying to do. If you just do a little
> fetching, updating, inserting etc. it's pretty easy. More complicate
> stuff I would do in java and expose it by a custom api by a java 
> service.
>> .) What's a good solution for the 37 TB or the upcoming ~120 TB to
>> distribute?
>>    [ ] N Servers with 1 37 TB mountpoints per server?
>>    [ ] N Servers with x TB mountpoints pers server?
>>    [ ] other:
> that's "not your business". hbase/hadoop does the trick for you. hbase
> distributes the data, replicates it etc.. You will only talk to the 
> master.

Well but at the end of the day I will need a physical storage 
distributed over
x servers.

My question is do I need to care that all servers have enough storage 
for the
whole data?

As far as I have understood hadoop client see a 'Filesystem' with 37 TB 
120 TB but from the server point of view how should I plan the 
setup for the datanodes.

As from the link below hadoophbase-capacity-planning and


Here are the recommended specifications for DataNode/TaskTrackers in a 
balanced Hadoop cluster:

     12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) 

What happen when a datanode have 20TB but the whole hadoop/HBase 2 node 
cluster have 40?

I see I'm still new to hadoop/HBase concept.

>> .) Is HBase a good value for $Storage?
> yes ;)
>> .) Is HBase a good value for $DB?
>>     DB-Size is smaller then 1 GB, I would use HBase just for HA 
>> features
>>     of Hadoop.
> well, the official documentation says:
> »First, make sure you have enough data. If you have hundreds of 
> millions
> or billions of rows, then HBase is a good candidate. If you only have a
> few thousand/million rows, then using a traditional RDBMS might be a
> better choice ...«

Okay so I will stay for this on postgresql with pgbouncer.

> In my experience at around 1-10 million rows RDBMS are not really
> useable anymore. But I only used small/cheap hardware ... and don't 
> like
> RDBMS ;).


> Well, you will have at least 40 million rows ... and the plattform is
> growing. I think SQL isn't a choice anymore. And as you have heavy read
> and only a few writes hbase is a good fit.

?! why "40 million rows", do you mean the file tables?
In the DB is only some Data like, User account, id for a directory and 
so on.

>> .) Due to the fact that HBase is a file-system I could use
>>       /cams , for binary data
>>       /DB   , for DB storage
>>       /logs , for log storage
>>     but is this wise. On the 'disk' they are different RAIDs.
> hbase is a data store. This was probably copy pasted from the original
> hadoop question ;).


>> .) Should I plan a dedicated Network+Card for the 'cluster
>>    communication' as for the most other cluster software?
>>    From what I have read it looks not necessary but from security 
>> point
>>    of view, yes.
> http://blog.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/
> Cloudera employees says that it wouldn't harm if you have to push a lot
> of data to the cluster.

Okay, so it is like other cluster setups.

>> .) Maybe the communication with the componnents (hadoop, zk, ...) 
>> could
>>    be setup ed with TLS?
> hbase is build on top of hadoop/hdfs. This in the "hadoop domain".
> hadoop can encrypt the transported data by TLS, can encrypt the data on
> the disc, you can use kerberos auth (but this stuff I never did) etc.
> etc.. So the answer is yes.


> Last remark: You seem kind of bound to PHP. The hadoop world is written
> in java. Of course there are a lot of ways to do stuff in other
> languages, over interfaces etc. But the java api is the most powerful
> and sometimes there are no other ways then to use it directly.

Currently, yes php is the main language.
I don't know a good solution for php similar like hadoop, anyone else 
know one?

I will take a look on


to get some Ideas for a working solution.

> Best wishes,
> Wilm

Thanks for your feedbak.
I will dig deeper into this topic and start to setup the components step 
by step.

BR Aleks

View raw message