hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Best way to write to multiple tables in one map-only job
Date Tue, 04 Oct 2011 21:49:02 GMT
>From the code gave the link to:

https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.java#L102

Hope this helps,

J-D

On Tue, Oct 4, 2011 at 7:20 AM, Christopher Dorner
<christopher.dorner@gmail.com> wrote:
> Thank you for the hint.
>
> What about autoflush then? Is that also something i can set using the config
> on job setup? Or does it onyl work with an HTable instance? Somehow i can't
> really find the right information :)
>
> Regards,
> Christopher
>
> Am 03.10.2011 19:20, schrieb Jean-Daniel Cryans:
>>
>> Option a) and b) are the same since MultiTableOutputFormat internally
>> uses multiple HTables. See for yourself:
>>
>>
>> https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.java
>>
>> Also you can set the write buffer but setting
>> hbase.client.write.buffer on the configuration that your pass in the
>> job setup.
>>
>> Using HTablePool in a single threaded application doesn't offer more
>> than just storage for your HTables.
>>
>> Hope that helps,
>>
>> J-D
>>
>> On Sat, Oct 1, 2011 at 4:05 AM, Christopher Dorner
>> <christopher.dorner@gmail.com>  wrote:
>>>
>>> Hallo,
>>>
>>> i am building a RDF Store using HBase and experimenting with different
>>> index
>>> tables and Schema Designs.
>>>
>>> For the input, i have a File where each line is a RDF triple in N3
>>> Format.
>>>
>>> I need to write to multiple Tables since i need to build several index
>>> tables. For the sake of reducing IO and not reading the file a few times
>>> i
>>> want to do that in one Map-Only Job. Later the file will contain a few
>>> million triples.
>>>
>>> I am experimenting in Pseudo-Distributed-Mode so far but will be able to
>>> run
>>> it on our cluster soon.
>>> Storing the data in the Tables does not need to be speed-optimized at any
>>> cost, but i just want to do it as simple and fast as possible.
>>>
>>>
>>> What is the best way to write to more than 1 table in one Map-Task?
>>>
>>> a)
>>> I can either use "MultiTableOutputFormat.class" and write in map() using:
>>> Put put = new Put(key);
>>> put.add(kv);
>>> context.write(tableName, put);
>>>
>>> Can i write to e.g. 6 Tables in this way by creating a new Put for each
>>> table?
>>>
>>> But how can i turn off autoFlush and set writeBufferSize in this case?
>>> Because i think autoflush is not that good in this case of putting lots
>>> of
>>> values.
>>>
>>>
>>> b)
>>> I can use an instance of HTable in the Mapper class. Then i can set
>>> autoFlush and writeBufferSize and write to the table using:
>>> HTable table = new HTable(config, tableName);
>>> table.put(put);
>>>
>>> But it is recommended to use only one instance of HTable, so i would need
>>> to
>>> do
>>> table = new Table();
>>> for each table i want to write to. Is that still fine with 6 tables?
>>> I stumbled upon HTablePool. Is this for these scenarios?
>>>
>>>
>>> Thank You and Regards,
>>> Christopher
>>>
>
>

Mime
View raw message