spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinay <vinay...@gmail.com>
Subject Re: Spark - Loading in data from CSVs and Postgres
Date Fri, 18 Oct 2013 11:18:16 GMT
An option would be to use hdfs for loading CSV , and jdbc support to load tables from Postgres.



Regards,
Vinay

> On Oct 18, 2013, at 1:24 AM, Victor Hooi <victorhooi@yahoo.com> wrote:
> 
> Hi,
> 
> NB: I originally posted this to the Google Group, before I saw the message about how
we're moving to the Apache Incubator mailing list.
> 
> I'm new to Spark, and I wanted to get some advice on the best way to load our data into
it:
> A CSV file generated each day, which contain user click data
> A Django app, which is running on top of PostgreSQL, containing user and transaction
data
> We do want the data load to be fairly quick, but we'd also want interactive queries to
be fast, so if anybody can explain any tradeoffs in Spark we'd need to make on either, that
would be good as well. I'd be leaning towards sacrificing load speed to speed up queries,
for our use cases.
> 
> I'm guessing we'd be looking at loading this data in once a day (or perhaps a few times
throughout the day). Unless there's a good way to stream in the above types of sources?
> 
> My question is - what are the current recommended practices for loading in the above?
> 
> With the CSV file, could we split it up, to parallelise the load? How would we do this
in Spark?
> 
> And with the Django app - I'm guessing I can either use Django's in-built ORM, or we
could query the PostgreSQL database directly? Any pros/cons of either approach? Or should
I be investigating something like Sqoop (or whatever the Spark equivalent tool is?).
> 
> Cheers,
> Victor

Mime
View raw message