spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Rai <ro...@tuplejump.com>
Subject Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.
Date Thu, 26 Jun 2014 09:03:12 GMT
Hi Gerard,

What is the version of Spark, Hadoop, Cassandra and Calliope are you using.
We never built Calliope to Hadoop2 as we/or our clients don't use Hadoop in
their deployments or use it only as the Infra component for Spark in which
case H1/H2 doesn't make a difference for them.

I know atleast of one case where the user had built Calliope against 2.0
and was using it happily. If you need assistance with it we are here to
help. Feel free to reach out to me directly and we can work out a solution
for you.

Regards,
Rohit


*Founder & CEO, **Tuplejump, Inc.*
____________________________
www.tuplejump.com
*The Data Engineering Platform*


On Thu, Jun 26, 2014 at 12:44 AM, Gerard Maas <gerard.maas@gmail.com> wrote:

> Thanks Nick.
>
> We used the CassandraOutputFormat through Calliope. The Calliope API makes
> the CassandraOutputFormat quite accessible  and is cool to work with.  It
> worked fine at prototype level, but we had Hadoop version conflicts when we
> put it in our Spark environment (Using our Spark assembly compiled with
> CDH4.4). The conflict seems to be at the Cassandra-all lib level, which is
> compiled against a different hadoop version  (v1).
>
> We could not get round that issue. (Any pointers in that direction?)
>
> That's why I'm trying the direct CQLSSTableWriter way but it looks blocked
> as well.
>
>  -kr, Gerard.
>
>
>
>
> On Wed, Jun 25, 2014 at 8:57 PM, Nick Pentreath <nick.pentreath@gmail.com>
> wrote:
>
>> can you not use a Cassandra OutputFormat? Seems they have
>> BulkOutputFormat. An example of using it with Hadoop is here:
>> http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html
>>
>> Using it with Spark will be similar to the examples:
>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
>> and
>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraCQLTest.scala
>>
>>
>> On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas <gerard.maas@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> (My excuses for the cross-post from SO)
>>>
>>> I'm trying to create Cassandra SSTables from the results of a batch
>>> computation in Spark. Ideally, each partition should create the SSTable for
>>> the data it holds in order to parallelize the process as much as possible
>>> (and probably even stream it to the Cassandra ring as well)
>>>
>>> After the initial hurdles with the CQLSSTableWriter (like requiring the
>>> yaml file), I'm confronted now with this issue:
>>>
>>>
>>>
>>> java.lang.RuntimeException: Attempting to load already loaded column family customer.rawts
>>>     at org.apache.cassandra.config.Schema.load(Schema.java:347)
>>>     at org.apache.cassandra.config.Schema.load(Schema.java:112)
>>>     at org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:336)
>>>
>>> I'm creating a writer on each parallel partition like this:
>>>
>>>
>>>
>>> def store(rdd:RDD[Message]) = {
>>>     rdd.foreachPartition( msgIterator => {
>>>       val writer = CQLSSTableWriter.builder()
>>>         .inDirectory("/tmp/cass")
>>>         .forTable(schema)
>>>         .using(insertSttmt).build()
>>>       msgIterator.foreach(msg => {...})
>>>     })}
>>>
>>> And if I'm reading the exception correctly, I can only create one writer
>>> per table in one JVM. Digging a bit further in the code, it looks like the
>>> Schema.load(...) singleton enforces that limitation.
>>>
>>> I guess writings to the writer will not be thread-safe and even if they
>>> were the contention that multiple threads will create by having all
>>> parallel tasks trying to dump few GB of data to disk at the same time will
>>> defeat the purpose of using the SSTables for bulk upload anyway.
>>>
>>> So, are there ways to use the CQLSSTableWriter concurrently?
>>>
>>> If not, what is the next best option to load batch data at high
>>> throughput in Cassandra?
>>>
>>> Will the upcoming Spark-Cassandra integration help with this? (ie.
>>> should I just sit back, relax and the problem will solve itself?)
>>>
>>> Thanks,
>>>
>>> Gerard.
>>>
>>
>>
>

Mime
View raw message