hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleks Laz <al-userhb...@none.at>
Subject Re: Newbie Question about 37TB binary storage on HBase
Date Fri, 28 Nov 2014 00:32:36 GMT
Dear wilm and ted.

Thanks for your input and ideas.

I will now step back and learn more about big data and big storage to
be able to talk further.

Cheers Aleks

Am 28-11-2014 01:20, schrieb Wilm Schumacher:
> Am 28.11.2014 um 00:32 schrieb Aleks Laz:
>> What's the plan about the "MOB-extension"?
> https://issues.apache.org/jira/browse/HBASE-11339
> 
>> From development point of view I can build HBase with the 
>> "MOB-extension"
>> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) 
>> is
>> much
>> easier to maintain.
> that's true :/
> 
>> We need to make some "accesslog" analyzing like piwik or awffull.
> I see. Well, this is of course possible, too.
> 
>> Maybe elasticsearch is a better tool for that?
> I used elastic search for full text search. Works veeery well :D. Loved
> it. But I never used it as primary database. And I wouldn't see an
> advantage for using ES here.
> 
>> As far as I have understood hadoop client see a 'Filesystem' with 37 
>> TB or
>> 120 TB but from the server point of view how should I plan the
>> storage/server
>> setup for the datanodes.
> now I get your question. If you have a replication factor of 3 (so 
> every
> data is hold three times by the cluster), then the aggregated storage
> has to be at least 3 times the 120 TB (+ buffer + operating system
> etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.
> 
>> What happen when a datanode have 20TB but the whole hadoop/HBase 2 
>> node
>> cluster have 40?
> well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
> distributes the data over the nodes.
> 
>> ?! why "40 million rows", do you mean the file tables?
>> In the DB is only some Data like, User account, id for a directory and
>> so on.
> If you use hbase as primary storage, every file would be a row. Think 
> of
> a "blob" in RDBMS. 40 millions files => 40 million rows.
> 
> Assume you create an access log for the 40 millions files and assume
> every file is accessed 100 times and every access is a row in another
> "access log" table => 4 billion rows ;).
> 
>> Currently, yes php is the main language.
>> I don't know a good solution for php similar like hadoop, anyone else
>> know one?
> well, the basic stuff could be done by thrift/rest with a native php
> binding. It depends on what you are trying to do. If it's just CRUD and
> some scanning and filtering, thrift/rest should be enough. But as you
> said ... who knows what the future brings. If you want to do the fancy
> stuff, you should use java and deliver the data to your php 
> application-
> 
> Just for completeness: There is HiveQL, too. This is kind of "SQL for
> hadoop". There is a hive client for php (as it is delivered by thrift)
> https://cwiki.apache.org/confluence/display/Hive/HiveClient
> 
> Another fitting option for your access log could be cassandra. 
> Cassandra
> is good at write performance, thus it is used for logging. Cassandra 
> has
> a "sql like" language, called cql. This works from php almost like a
> normal RDBMS. Prepared statements and all this stuff.
> 
> But I think this is done the wrong way around. You should select a
> technology and then choose the language/interfaces etc. And if you
> choose hbase, and java is a good choice, and you use nginx and php is a
> good choice, the only task is to deliver data from A to B and back.
> 
> Best wishes,
> 
> Wilm

Mime
View raw message