spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Bulk-load to HBase
Date Fri, 19 Sep 2014 21:29:51 GMT
Please see http://hbase.apache.org/book.html#completebulkload

LoadIncrementalHFiles has a main() method.


On Fri, Sep 19, 2014 at 5:41 AM, Aniket Bhatnagar <
aniket.bhatnagar@gmail.com> wrote:

> Agreed that the bulk import would be faster. In my case, I wasn't
> expecting a lot of data to be uploaded to HBase and also, I didn't want to
> take the pain of importing generated HFiles into HBase. Is there a way to
> invoke HBase HFile import batch script programmatically?
>
> On 19 September 2014 17:58, innowireless TaeYun Kim <
> taeyun.kim@innowireless.co.kr> wrote:
>
>> In fact, it seems that Put can be used by HFileOutputFormat, so Put
>> object itself may not be the problem.
>>
>> The problem is that TableOutputFormat uses the Put object in the normal
>> way (that goes through normal write path), while HFileOutFormat uses it to
>> directly build the HFile.
>>
>>
>>
>> *From:* innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
>> *Sent:* Friday, September 19, 2014 9:20 PM
>>
>> *To:* user@spark.apache.org
>> *Subject:* RE: Bulk-load to HBase
>>
>>
>>
>> Thank you for the example code.
>>
>>
>>
>> Currently I use foreachPartition() + Put(), but your example code can be
>> used to clean up my code.
>>
>>
>>
>> BTW, since the data uploaded by Put() goes through normal HBase write
>> path, it can be slow.
>>
>> So, it would be nice if bulk-load could be used, since it bypasses the
>> write path.
>>
>>
>>
>> Thanks.
>>
>>
>>
>> *From:* Aniket Bhatnagar [mailto:aniket.bhatnagar@gmail.com
>> <aniket.bhatnagar@gmail.com>]
>> *Sent:* Friday, September 19, 2014 9:01 PM
>> *To:* innowireless TaeYun Kim
>> *Cc:* user
>> *Subject:* Re: Bulk-load to HBase
>>
>>
>>
>> I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat
>> instead of HFileOutputFormat. But, hopefully this should help you:
>>
>>
>>
>> val hbaseZookeeperQuorum =
>> s"$zookeeperHost:$zookeeperPort:$zookeeperHbasePath"
>>
>> val conf = HBaseConfiguration.create()
>>
>> conf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum)
>>
>> conf.set(TableOutputFormat.QUORUM_ADDRESS, hbaseZookeeperQuorum)
>>
>> conf.set(TableOutputFormat.QUORUM_PORT, zookeeperPort.toString)
>>
>> conf.setClass("mapreduce.outputformat.class",
>> classOf[TableOutputFormat[Object]], classOf[OutputFormat[Object, Writable]])
>>
>> conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
>>
>>
>>
>> val rddToSave: RDD[(Array[Byte], Array[Byte], Array[Byte])] = ... // Some
>> RDD that contains row key, column qualifier and data
>>
>>
>>
>> val putRDD = rddToSave.map(tuple => {
>>
>>     val (rowKey, column data) = tuple
>>
>>     val put: Put = new Put(rowKey)
>>
>>     put.add(COLUMN_FAMILY_RAW_DATA_BYTES, column, data)
>>
>>
>>
>>     (new ImmutableBytesWritable(rowKey), put)
>>
>> })
>>
>>
>>
>> putRDD.saveAsNewAPIHadoopDataset(conf)
>>
>>
>>
>>
>>
>> On 19 September 2014 16:52, innowireless TaeYun Kim <
>> taeyun.kim@innowireless.co.kr> wrote:
>>
>> Hi,
>>
>>
>>
>> Sorry, I just found saveAsNewAPIHadoopDataset.
>>
>> Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is
>> there any example code for that?
>>
>>
>>
>> Thanks.
>>
>>
>>
>> *From:* innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
>> *Sent:* Friday, September 19, 2014 8:18 PM
>> *To:* user@spark.apache.org
>> *Subject:* RE: Bulk-load to HBase
>>
>>
>>
>> Hi,
>>
>>
>>
>> After reading several documents, it seems that saveAsHadoopDataset cannot
>> use HFileOutputFormat.
>>
>> It’s because saveAsHadoopDataset method uses JobConf, so it belongs to
>> the old Hadoop API, while HFileOutputFormat is a member of mapreduce
>> package which is for the new Hadoop API.
>>
>>
>>
>> Am I right?
>>
>> If so, is there another method to bulk-load to HBase from RDD?
>>
>>
>>
>> Thanks.
>>
>>
>>
>> *From:* innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr
>> <taeyun.kim@innowireless.co.kr>]
>> *Sent:* Friday, September 19, 2014 7:17 PM
>> *To:* user@spark.apache.org
>> *Subject:* Bulk-load to HBase
>>
>>
>>
>> Hi,
>>
>>
>>
>> Is there a way to bulk-load to HBase from RDD?
>>
>> HBase offers HFileOutputFormat class for bulk loading by MapReduce job,
>> but I cannot figure out how to use it with saveAsHadoopDataset.
>>
>>
>>
>> Thanks.
>>
>>
>>
>
>

Mime
View raw message