Thank you for the example code.


Currently I use foreachPartition() + Put(), but your example code can be used to clean up my code.


BTW, since the data uploaded by Put() goes through normal HBase write path, it can be slow.

So, it would be nice if bulk-load could be used, since it bypasses the write path.




I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat instead of HFileOutputFormat. But, hopefully this should help you:


val hbaseZookeeperQuorum = s"$zookeeperHost:$zookeeperPort:$zookeeperHbasePath"

val conf = HBaseConfiguration.create()

conf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum)

conf.set(TableOutputFormat.QUORUM_ADDRESS, hbaseZookeeperQuorum)

conf.set(TableOutputFormat.QUORUM_PORT, zookeeperPort.toString)

conf.setClass("mapreduce.outputformat.class", classOf[TableOutputFormat[Object]], classOf[OutputFormat[Object, Writable]])

conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)


val rddToSave: RDD[(Array[Byte], Array[Byte], Array[Byte])] = ... // Some RDD that contains row key, column qualifier and data


val putRDD = => {

    val (rowKey, column data) = tuple

    val put: Put = new Put(rowKey)

    put.add(COLUMN_FAMILY_RAW_DATA_BYTES, column, data)


    (new ImmutableBytesWritable(rowKey), put)






Sorry, I just found saveAsNewAPIHadoopDataset.

Then, Can I use HFileOutputFormat with saveAsNewAPIHadoopDataset? Is there any example code for that?




After reading several documents, it seems that saveAsHadoopDataset cannot use HFileOutputFormat.

It’s because saveAsHadoopDataset method uses JobConf, so it belongs to the old Hadoop API, while HFileOutputFormat is a member of mapreduce package which is for the new Hadoop API.


Am I right?

If so, is there another method to bulk-load to HBase from RDD?




Is there a way to bulk-load to HBase from RDD?

HBase offers HFileOutputFormat class for bulk loading by MapReduce job, but I cannot figure out how to use it with saveAsHadoopDataset.