spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soumitra Kumar <kumar.soumi...@gmail.com>
Subject Re: Bulk-load to HBase
Date Sat, 20 Sep 2014 04:44:01 GMT
I successfully did this once.

RDD map to RDD [(ImmutableBytesWritable, KeyValue)]
then
val conf = HBaseConfiguration.create()
val job = new Job (conf, "CEF2HFile")
job.setMapOutputKeyClass (classOf[ImmutableBytesWritable]);
job.setMapOutputValueClass (classOf[KeyValue]);
val table = new HTable(conf, "output")
HFileOutputFormat.configureIncrementalLoad (job, table);
saveAsNewAPIHadoopFile("hdfs://localhost.localdomain:8020/user/cloudera/spark", classOf[ImmutableBytesWritable],
classOf[Put], classOf[HFileOutputFormat], conf)

Then I do
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/cloudera/spark output
to load the HFiles to hbase.

----- Original Message -----
From: "Ted Yu" <yuzhihong@gmail.com>
To: "Aniket Bhatnagar" <aniket.bhatnagar@gmail.com>
Cc: "innowireless TaeYun Kim" <taeyun.kim@innowireless.co.kr>, "user" <user@spark.apache.org>
Sent: Friday, September 19, 2014 2:29:51 PM
Subject: Re: Bulk-load to HBase


Please see http://hbase.apache.org/book.html#completebulkload 
LoadIncrementalHFiles has a main() method. 


On Fri, Sep 19, 2014 at 5:41 AM, Aniket Bhatnagar < aniket.bhatnagar@gmail.com > wrote:




Agreed that the bulk import would be faster. In my case, I wasn't expecting a lot of data
to be uploaded to HBase and also, I didn't want to take the pain of importing generated HFiles
into HBase. Is there a way to invoke HBase HFile import batch script programmatically? 




On 19 September 2014 17:58, innowireless TaeYun Kim < taeyun.kim@innowireless.co.kr >
wrote: 






In fact, it seems that Put can be used by HFileOutputFormat, so Put object itself may not
be the problem. 

The problem is that TableOutputFormat uses the Put object in the normal way (that goes through
normal write path), while HFileOutFormat uses it to directly build the HFile. 





From: innowireless TaeYun Kim [mailto: taeyun.kim@innowireless.co.kr ] 
Sent: Friday, September 19, 2014 9:20 PM 


To: user@spark.apache.org 
Subject: RE: Bulk-load to HBase 







Thank you for the example code. 



Currently I use foreachPartition() + Put(), but your example code can be used to clean up
my code. 



BTW, since the data uploaded by Put() goes through normal HBase write path, it can be slow.


So, it would be nice if bulk-load could be used, since it bypasses the write path. 



Thanks. 



From: Aniket Bhatnagar [ mailto:aniket.bhatnagar@gmail.com ] 
Sent: Friday, September 19, 2014 9:01 PM 
To: innowireless TaeYun Kim 
Cc: user 
Subject: Re: Bulk-load to HBase 




I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat instead of HFileOutputFormat.
But, hopefully this should help you: 






val hbaseZookeeperQuorum = s"$zookeeperHost:$zookeeperPort:$zookeeperHbasePath" 


val conf = HBaseConfiguration.create() 


conf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum) 


conf.set(TableOutputFormat.QUORUM_ADDRESS, hbaseZookeeperQuorum) 


conf.set(TableOutputFormat.QUORUM_PORT, zookeeperPort.toString) 



conf.setClass("mapreduce.outputformat.class", classOf[TableOutputFormat[Object]], classOf[OutputFormat[Object,
Writable]]) 


conf.set(TableOutputFormat.OUTPUT_TABLE, tableName) 





val rddToSave: RDD[(Array[Byte], Array[Byte], Array[Byte])] = ... // Some RDD that contains
row key, column qualifier and data 






val putRDD = rddToSave.map(tuple => { 


val (rowKey, column data) = tuple 


val put: Put = new Put(rowKey) 


put.add(COLUMN_FAMILY_RAW_DATA_BYTES, column, data) 





(new ImmutableBytesWritable(rowKey), put) 


}) 





putRDD.saveAsNewAPIHadoopDataset(conf) 








On 19 September 2014 16:52, innowireless TaeYun Kim < taeyun.kim@innowireless.co.kr >
wrote: 



Hi, 



Sorry, I just found saveAsNewAPIHadoopDataset. 

Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is there any example code
for that? 



Thanks. 





From: innowireless TaeYun Kim [mailto: taeyun.kim@innowireless.co.kr ] 
Sent: Friday, September 19, 2014 8:18 PM 
To: user@spark.apache.org 
Subject: RE: Bulk-load to HBase 





Hi, 



After reading several documents, it seems that saveAsHadoopDataset cannot use HFileOutputFormat.


It ’ s because saveAsHadoopDataset method uses JobConf, so it belongs to the old Hadoop
API, while HFileOutputFormat is a member of mapreduce package which is for the new Hadoop
API. 



Am I right? 

If so, is there another method to bulk-load to HBase from RDD? 



Thanks. 





From: innowireless TaeYun Kim [ mailto:taeyun.kim@innowireless.co.kr ] 
Sent: Friday, September 19, 2014 7:17 PM 
To: user@spark.apache.org 
Subject: Bulk-load to HBase 



Hi, 



Is there a way to bulk-load to HBase from RDD? 

HBase offers HFileOutputFormat class for bulk loading by MapReduce job, but I cannot figure
out how to use it with saveAsHadoopDataset. 



Thanks. 




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message