There is a Hcatalog project which provides the abstraction layer for different types file formats including csv as well as SQL. I don't know if this works well with spark or not. I posted the question few days ago about HCatalog, but did not get any response.


Sent from my iPad

On Oct 18, 2013, at 4:18 AM, Vinay <> wrote:

An option would be to use hdfs for loading CSV , and jdbc support to load tables from Postgres. 


On Oct 18, 2013, at 1:24 AM, Victor Hooi <> wrote:


NB: I originally posted this to the Google Group, before I saw the message about how we're moving to the Apache Incubator mailing list.

I'm new to Spark, and I wanted to get some advice on the best way to load our data into it:
  1. A CSV file generated each day, which contain user click data
  2. A Django app, which is running on top of PostgreSQL, containing user and transaction data
We do want the data load to be fairly quick, but we'd also want interactive queries to be fast, so if anybody can explain any tradeoffs in Spark we'd need to make on either, that would be good as well. I'd be leaning towards sacrificing load speed to speed up queries, for our use cases.

I'm guessing we'd be looking at loading this data in once a day (or perhaps a few times throughout the day). Unless there's a good way to stream in the above types of sources?

My question is - what are the current recommended practices for loading in the above?

With the CSV file, could we split it up, to parallelise the load? How would we do this in Spark?

And with the Django app - I'm guessing I can either use Django's in-built ORM, or we could query the PostgreSQL database directly? Any pros/cons of either approach? Or should I be investigating something like Sqoop (or whatever the Spark equivalent tool is?).