spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Kim <>
Subject RE: write data into HBase via spark
Date Sat, 07 Dec 2013 05:32:00 GMT
Hi Phillip/Hao,
I was wondering if there is a simple working example out there that I can just run and see
it work. Then, I can customize it for our needs. Unfortunately, this explanation still confuses
me a little.
Here is a little about the environment we are working with. We have Cloudera's CDH 4.4.0 installed,
and it comes with HBase 0.94.6. We get data streamed in using Flume-NG 1.4.0. All of this
is managed using Cloudera Manager 4.7.2 to setup and configure these services.
If you need any more information or are able to help, I would be glad to accommodate.

Date: Fri, 6 Dec 2013 18:07:08 -0700
Subject: Re: write data into HBase via spark



    Thank you for the detailed response!  (even if delayed!)  


    I'm curious to know what version of hbase you added to your pom





    On 11/14/2013 10:38 AM, Hao REN wrote:

      Hi, Philip.

        Basically, we need PairRDDFunctions.saveAsHadoopDataset
          to do the job, as HBase is not a fs, saveAsHadoopFile doesn't

        def saveAsHadoopDataset(conf: JobConf): Unit

        this function takes a JobConf parameter which should be
          configured. Essentially, you need to set output format and the
          name of the output table.

        // step 1: JobConf setup:

        // Note: mapred package is used, instead of the mapreduce
          package which contains new hadoop APIs.
        import org.apache.hadoop.hbase.mapred.TableOutputFormat

        import org.apache.hadoop.hbase.client._

        // ... some other settings

          val conf = HBaseConfiguration.create()

          // general hbase setting

          conf.set("hbase.rootdir", "hdfs://" + nameNodeURL +
              ":" + hdfsPort + "/hbase")
          conf.setBoolean("hbase.cluster.distributed", true)
          conf.set("hbase.zookeeper.quorum", hostname)
          conf.setInt("hbase.client.scanner.caching", 10000)
        // ... some other settings

          val jobConfig: JobConf = new JobConf(conf,

          // Note:  TableOutputFormat is used as deprecated code,
            because JobConf is an old hadoop API


        // step 2: give your mapping:


        // the last thing todo is mapping your local data schema to
          the hbase one
        // Say, our hbase schema is as below:
        // row    cf:col_1    cf:col_2


        // And in spark, you have a RDD of triple, like (1, 2, 3),
          (4, 5, 6), ...

        // So you should map RDD[(int, int, int)] to RDD[(ImmutableBytesWritable,
            Put)], where Put carries the mapping.

        // You can define a function used by, for example:

          def convert(triple: (Int, Int, Int)) = {
                val p = new Put(Bytes.toBytes(triple._1))
              Bytes.toBytes("col_1"), Bytes.toBytes(triple._2))
              Bytes.toBytes("col_2"), Bytes.toBytes(triple._3))
                (new ImmutableBytesWritable, p)

        // Suppose you have a RDD[(Int, Int, Int)] called localData,
          then writing data to hbase can be done by :


        VoilĂ . That's all you need. Hopefully, this simple example
          could help.






        2013/11/13 Philip Ogren <>



              If you have worked out the code and turn it into an
              example that you can share, then please do!  This task is
              in my queue of things to do so any helpful details that
              you uncovered would be most appreciated.






                  On 11/13/2013 5:30 AM, Hao REN wrote:

                    Ok, I worked it out.

                      The following thread helps a lot.





                      2013/11/12 Hao REN <>

                          Could someone show me a simple
                            example about how to write data into HBase
                            via spark ?

                             I have checked HbaseTest example, it's
                              only for reading from HBase.

                            Thank you. 


                                    REN Hao

                                    Data Engineer @ ClaraVista

                                    Paris, France

                                    Tel:  +33 06 14 54 57



                        REN Hao

                        Data Engineer @ ClaraVista

                        Paris, France

                        Tel:  +33 06
                            14 54 57 24




          REN Hao

          Data Engineer @ ClaraVista

          Paris, France

          Tel:  +33 06 14 54 57 24
View raw message