spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kan Zhang <kzh...@apache.org>
Subject Re: pyspark and cassandra
Date Wed, 10 Sep 2014 19:35:18 GMT
Thanks for the clarification, Yadid. By "Hadoop jobs," I meant Spark jobs
that use Hadoop inputformats (as shown in the cassandra_inputformat.py
 example).

A future possibility of accessing Cassandra from PySpark is when SparkSQL
supports Cassandra as a data source.

On Wed, Sep 10, 2014 at 11:37 AM, yadid ayzenberg <ayzen1@gmail.com> wrote:

>
> You do not need to actually use Hadoop to read from cassandra. The hadoop
> inputformat is a standard way for hadoop jobs to read data from various
> sources. Spark can utilize input formats as well.
> The storage level has nothing to do with source of the data - be it
> cassandra or a file system such as HDFS. By using DISK_ONLY you are telling
> spark to cache the RDDs on disk only (and not memory).
>
> On Wed, Sep 10, 2014 at 11:31 AM, Oleg Ruchovets <oruchovets@gmail.com>
> wrote:
>
>> Hi ,
>>   I try to evaluate different option of spark + cassandra and I have
>> couple of additional questions.
>>   My aim is to use cassandra only without hadoop:
>>   1) Is it possible to use only cassandra as input/output parameter for
>> PySpark?
>>   2) In case I'll use Spark (java,scala) is it possible to use only
>> cassandra - input/output without hadoop?
>>   3) I know there are couple of strategies for storage level, in case my
>> data set is quite big and I have no enough memory to process - can I use
>> DISK_ONLY option without hadoop (having only cassandra)?
>>
>> Thanks
>> Oleg
>>
>> On Wed, Sep 3, 2014 at 3:08 AM, Kan Zhang <kzhang@apache.org> wrote:
>>
>>> In Spark 1.1, it is possible to read from Cassandra using Hadoop jobs.
>>> See examples/src/main/python/cassandra_inputformat.py for an example.
>>> You may need to write your own key/value converters.
>>>
>>>
>>> On Tue, Sep 2, 2014 at 11:10 AM, Oleg Ruchovets <oruchovets@gmail.com>
>>> wrote:
>>>
>>>> Hi All ,
>>>>    Is it possible to have cassandra as input data for PySpark. I found
>>>> example for java -
>>>> http://java.dzone.com/articles/sparkcassandra-stack-perform?page=0,0
>>>> and I am looking something similar for python.
>>>>
>>>> Thanks
>>>> Oleg.
>>>>
>>>
>>>
>>
>

Mime
View raw message