NB: I originally posted this to the Google Group, before I saw the message about how we're moving to the Apache Incubator mailing list.
I'm new to Spark, and I wanted to get some advice on the best way to load our data into it:
- A CSV file generated each day, which contain user click data
- A Django app, which is running on top of PostgreSQL, containing user and transaction data
We do want the data load to be fairly quick, but we'd also want interactive queries to be fast, so if anybody can explain any tradeoffs in Spark we'd need to make on either, that would be good as well. I'd be leaning towards sacrificing load speed to speed up queries, for our use cases.
I'm guessing we'd be looking at loading this data in once a day (or perhaps a few times throughout the day). Unless there's a good way to stream in the above types of sources?
My question is - what are the current recommended practices for loading in the above?
With the CSV file, could we split it up, to parallelise the load? How would we do this in Spark?
And with the Django app - I'm guessing I can either use Django's in-built ORM, or we could query the PostgreSQL database directly? Any pros/cons of either approach? Or should I be investigating something like Sqoop (or whatever the Spark equivalent tool is?).