spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Rai <ro...@tuplejump.com>
Subject Re: Spark integration with HDFS and Cassandra simultaneously
Date Mon, 28 Oct 2013 06:53:55 GMT
Gary,

As Patrick suggests, you can read from HDFS, to create an RDD and output
the RDD to C*.

On writing to C*, look at the Cassandra example here -
https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala

Of interest will be lines 104 to 127 which show how to transform an RDD to
C* mutations.

<shameless_plug>
If you would like your analytics team to be able to do the transforms and
not worry about understanding mutations and stuff, I'll again suggest take
a look at Calliope, in which you can provide the transforms as implicits in
the Shell so they don't even need to know about it.

You can additionally provide cas config also as predefined variables so all
the analytics guys need to know is they are writing to C*.

Of course you can already do all that without calliope too, just that it
will make your work easier. ;)

I you want to use Calliope,
You can read about writing using Calliope here -
http://tuplejump.github.io/calliope/show-me-the-code.html

And if you really don't want to signup for the early access release you can
get the G.A. release along with source and instructions to get the binaries
from here -
https://github.com/tuplejump/calliope-release

</shameless_plug>

Regards,
Rohit
founder @ tuplejump




On Sun, Oct 27, 2013 at 10:44 AM, Patrick Wendell <pwendell@gmail.com>wrote:

> Hey Rohit,
>
> A single SparkContext can be used to read and write files of different
> formats, including HDFS or cassandra. For instance you could do this:
>
> rdd1 = sc.textFile(XXX)  // Some text file in HDFS
> rdd1.saveAsHadoopFile(.., classOf[ColumnFamilyOutputFormat], ...)  // Save
> into a cassandra file (see Cassandra example)
>
> This is a common pattern when using Spark for ETL between different
> storage systems.
>
> - Patrick
>
>
> On Sat, Oct 26, 2013 at 7:31 PM, Gary Malouf <malouf.gary@gmail.com>wrote:
>
>> Hi Rohit,
>>
>> We are big users of the Spark Shell - it is used by our analytics team
>> for the same purposes that Hive used to be.  The SparkContext which is
>> provided at startup I guess would have to be one of HDFS or Cassandra - I
>> take it we would then manually create a second context?
>>
>> Thanks,
>>
>> Gary
>>
>>
>> On Sat, Oct 26, 2013 at 1:07 PM, Rohit Rai <rohit@tuplejump.com> wrote:
>>
>>> Hello Gary,
>>>
>>> This is very easy to do. You can read your data from HDFS using
>>> FileInputFormat, transform it to a desired rows and write to Cassandra
>>> using ColumnFamilyInputFormat.
>>>
>>> Our library called Calliope (Apache Licensed),
>>> http://tuplejump.github.io/calliope/ can make the task of writing to C*
>>> easier.
>>>
>>>
>>> In case you don't want to convert it to rows and keep them as files in
>>> Cassandra, our lightweight Cassandra backed HDFS compatible filesystem,
>>> SnackFS can help you. SnackFS will be part of next Calliope release later
>>> this month, but we can provide you access if you would like to try it out.
>>>
>>> Feel free to mail me directly in case you need any assistance.
>>>
>>>
>>> Regards,
>>> Rohit
>>> founder @ tuplejump
>>>
>>>
>>>
>>>
>>> On Sat, Oct 26, 2013 at 5:45 AM, Gary Malouf <malouf.gary@gmail.com>wrote:
>>>
>>>> We have a use case in which much of our raw data is stored in HDFS
>>>> today.  We'd like to write our Spark jobs such that they read/aggregate
>>>> data from HDFS and can output to our Cassandra cluster.
>>>>
>>>> Is there any way of doing this in spark 0.7.3?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> ____________________________
>>> www.tuplejump.com
>>> *The Data Engineering Platform*
>>>
>>
>>
>


-- 

____________________________
www.tuplejump.com
*The Data Engineering Platform*

Mime
View raw message