spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)
Date Sat, 24 Jan 2015 16:06:19 GMT
I believe databricks provides an rdd interface to redshift. Did you check
spark-packages.org?
On 2015년 1월 24일 (토) at 오전 6:45 Denis Mikhalkin <denismo@yahoo.com.invalid>
wrote:

> Hello,
>
> we've got some analytics data in AWS Redshift. The data is being
> constantly updated.
>
> I'd like to be able to write a query against Redshift which would return a
> subset of data, and then run a Spark job (Pyspark) to do some analysis.
>
> I could not find an RDD which would let me do it OOB (Python), so I tried
> writing my own. For example, tried combination of a generator (via yield)
> with parallelize. It appears though that "parallelize" reads all the data
> first into memory as I get either OOM or Python swaps as soon as I increase
> the number of rows beyond trivial limits.
>
> I've also looked at Java RDDs (there is an example of MySQL RDD) but it
> seems that it also reads all the data into memory.
>
> So my question is - how to correctly feed Spark with huge datasets which
> don't initially reside in HDFS/S3 (ideally for Pyspark, but would
> appreciate any tips)?
>
> Thanks.
>
> Denis
>
>
>

Mime
View raw message