spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Rai <ro...@tuplejump.com>
Subject Re: Spark integration with HDFS and Cassandra simultaneously
Date Mon, 28 Oct 2013 17:58:25 GMT
Hello Thunder,

We don't use the hive branch underneath current Calliope release as it
focuses on Spark and Cassandra integration. In next EA release coming later
this month we plan to bring in the cas-handler to support Shark on
Cassandra.

Regards,
Rohit


On Mon, Oct 28, 2013 at 9:53 PM, Thunder Stumpges <
thunder.stumpges@gmail.com> wrote:

> This is great. I've been following this thread quietly, very interested!
>
> We are using Cassandra with CQL3 and composite primary keys (v2.0.1)
> with good success from our application servers. We also have
> Hadoop/Hive, but haven't been able to get Spark into production yet
> with how busy we have been.
>
> Just Friday I found https://github.com/milliondreams/hive.git as being
> a current connector for C* with Hadoop. Rohit, it looks like you're
> active on that project as well. Does Calliope use this library
> underneath?
>
> Thanks, great group here. Very excited to use Spark and Spark
> Streaming in the very near future!
>
> -Thunder
>
>
>
> On Sun, Oct 27, 2013 at 11:53 PM, Rohit Rai <rohit@tuplejump.com> wrote:
> > Gary,
> >
> > As Patrick suggests, you can read from HDFS, to create an RDD and output
> the
> > RDD to C*.
> >
> > On writing to C*, look at the Cassandra example here -
> >
> https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
> >
> > Of interest will be lines 104 to 127 which show how to transform an RDD
> to
> > C* mutations.
> >
> > <shameless_plug>
> > If you would like your analytics team to be able to do the transforms and
> > not worry about understanding mutations and stuff, I'll again suggest
> take a
> > look at Calliope, in which you can provide the transforms as implicits in
> > the Shell so they don't even need to know about it.
> >
> > You can additionally provide cas config also as predefined variables so
> all
> > the analytics guys need to know is they are writing to C*.
> >
> > Of course you can already do all that without calliope too, just that it
> > will make your work easier. ;)
> >
> > I you want to use Calliope,
> > You can read about writing using Calliope here -
> > http://tuplejump.github.io/calliope/show-me-the-code.html
> >
> > And if you really don't want to signup for the early access release you
> can
> > get the G.A. release along with source and instructions to get the
> binaries
> > from here -
> > https://github.com/tuplejump/calliope-release
> >
> > </shameless_plug>
> >
> > Regards,
> > Rohit
> > founder @ tuplejump
> >
> >
> >
> >
> > On Sun, Oct 27, 2013 at 10:44 AM, Patrick Wendell <pwendell@gmail.com>
> > wrote:
> >>
> >> Hey Rohit,
> >>
> >> A single SparkContext can be used to read and write files of different
> >> formats, including HDFS or cassandra. For instance you could do this:
> >>
> >> rdd1 = sc.textFile(XXX)  // Some text file in HDFS
> >> rdd1.saveAsHadoopFile(.., classOf[ColumnFamilyOutputFormat], ...)  //
> Save
> >> into a cassandra file (see Cassandra example)
> >>
> >> This is a common pattern when using Spark for ETL between different
> >> storage systems.
> >>
> >> - Patrick
> >>
> >>
> >> On Sat, Oct 26, 2013 at 7:31 PM, Gary Malouf <malouf.gary@gmail.com>
> >> wrote:
> >>>
> >>> Hi Rohit,
> >>>
> >>> We are big users of the Spark Shell - it is used by our analytics team
> >>> for the same purposes that Hive used to be.  The SparkContext which is
> >>> provided at startup I guess would have to be one of HDFS or Cassandra
> - I
> >>> take it we would then manually create a second context?
> >>>
> >>> Thanks,
> >>>
> >>> Gary
> >>>
> >>>
> >>> On Sat, Oct 26, 2013 at 1:07 PM, Rohit Rai <rohit@tuplejump.com>
> wrote:
> >>>>
> >>>> Hello Gary,
> >>>>
> >>>> This is very easy to do. You can read your data from HDFS using
> >>>> FileInputFormat, transform it to a desired rows and write to
> Cassandra using
> >>>> ColumnFamilyInputFormat.
> >>>>
> >>>> Our library called Calliope (Apache Licensed),
> >>>> http://tuplejump.github.io/calliope/ can make the task of writing to
> C*
> >>>> easier.
> >>>>
> >>>>
> >>>> In case you don't want to convert it to rows and keep them as files
in
> >>>> Cassandra, our lightweight Cassandra backed HDFS compatible
> filesystem,
> >>>> SnackFS can help you. SnackFS will be part of next Calliope release
> later
> >>>> this month, but we can provide you access if you would like to try it
> out.
> >>>>
> >>>> Feel free to mail me directly in case you need any assistance.
> >>>>
> >>>>
> >>>> Regards,
> >>>> Rohit
> >>>> founder @ tuplejump
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Sat, Oct 26, 2013 at 5:45 AM, Gary Malouf <malouf.gary@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> We have a use case in which much of our raw data is stored in HDFS
> >>>>> today.  We'd like to write our Spark jobs such that they
> read/aggregate data
> >>>>> from HDFS and can output to our Cassandra cluster.
> >>>>>
> >>>>> Is there any way of doing this in spark 0.7.3?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> ____________________________
> >>>> www.tuplejump.com
> >>>> The Data Engineering Platform
> >>>
> >>>
> >>
> >
> >
> >
> > --
> >
> > ____________________________
> > www.tuplejump.com
> > The Data Engineering Platform
>



-- 

____________________________
www.tuplejump.com
*The Data Engineering Platform*

Mime
View raw message