spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Mikhalkin <deni...@yahoo.com.INVALID>
Subject Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)
Date Sun, 25 Jan 2015 09:19:06 GMT
Hi Nicholas,
thanks for your reply. I checked spark-redshift - it's just for the unload data files stored
on hadoop, not for online result sets from DB.
Do you know of any example of a custom RDD which fetches the data on the fly (not reading
from HDFS)?
Thanks.
Denis
      From: Nicholas Chammas <nicholas.chammas@gmail.com>
 To: Denis Mikhalkin <denismo@yahoo.com>; "user@spark.apache.org" <user@spark.apache.org>

 Sent: Sunday, 25 January 2015, 3:06
 Subject: Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)
   
I believe databricks provides an rdd interface to redshift. Did you check spark-packages.org?
On 2015년 1월 24일 (토) at 오전 6:45 Denis Mikhalkin <denismo@yahoo.com.invalid>
wrote:

Hello,

we've got some analytics data in AWS Redshift. The data is being constantly updated.
I'd like to be able to write a query against Redshift which would return a subset of data,
and then run a Spark job (Pyspark) to do some analysis.
I could not find an RDD which would let me do it OOB (Python), so I tried writing my own.
For example, tried combination of a generator (via yield) with parallelize. It appears though
that "parallelize" reads all the data first into memory as I get either OOM or Python swaps
as soon as I increase the number of rows beyond trivial limits.
I've also looked at Java RDDs (there is an example of MySQL RDD) but it seems that it also
reads all the data into memory.
So my question is - how to correctly feed Spark with huge datasets which don't initially reside
in HDFS/S3 (ideally for Pyspark, but would appreciate any tips)?
Thanks.
Denis

   


  
Mime
View raw message