spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Victor Hooi <>
Subject Spark - Loading in data from CSVs and Postgres
Date Fri, 18 Oct 2013 05:24:08 GMT

*NB: I originally posted this to the Google Group, before I saw the message
about how we're moving to the Apache Incubator mailing list.*

I'm new to Spark, and I wanted to get some advice on the best way to load
our data into it:

   1. A CSV file generated each day, which contain user click data
   2. A Django app, which is running on top of PostgreSQL, containing user
   and transaction data

We do want the data load to be fairly quick, but we'd also want interactive
queries to be fast, so if anybody can explain any tradeoffs in Spark we'd
need to make on either, that would be good as well. I'd be leaning towards
sacrificing load speed to speed up queries, for our use cases.

I'm guessing we'd be looking at loading this data in once a day (or perhaps
a few times throughout the day). Unless there's a good way to stream in the
above types of sources?

My question is - what are the current recommended practices for loading in
the above?

With the CSV file, could we split it up, to parallelise the load? How would
we do this in Spark?

And with the Django app - I'm guessing I can either use Django's in-built
ORM, or we could query the PostgreSQL database directly? Any pros/cons of
either approach? Or should I be investigating something like Sqoop (or
whatever the Spark equivalent tool is?).


View raw message