spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chester <>
Subject Re: Spark - Loading in data from CSVs and Postgres
Date Fri, 18 Oct 2013 13:14:06 GMT
There is a Hcatalog project which provides the abstraction layer for different types file formats
including csv as well as SQL. I don't know if this works well with spark or not. I posted
the question few days ago about HCatalog, but did not get any response.


Sent from my iPad

On Oct 18, 2013, at 4:18 AM, Vinay <> wrote:

> An option would be to use hdfs for loading CSV , and jdbc support to load tables from
> Regards,
> Vinay
> On Oct 18, 2013, at 1:24 AM, Victor Hooi <> wrote:
>> Hi,
>> NB: I originally posted this to the Google Group, before I saw the message about
how we're moving to the Apache Incubator mailing list.
>> I'm new to Spark, and I wanted to get some advice on the best way to load our data
into it:
>> A CSV file generated each day, which contain user click data
>> A Django app, which is running on top of PostgreSQL, containing user and transaction
>> We do want the data load to be fairly quick, but we'd also want interactive queries
to be fast, so if anybody can explain any tradeoffs in Spark we'd need to make on either,
that would be good as well. I'd be leaning towards sacrificing load speed to speed up queries,
for our use cases.
>> I'm guessing we'd be looking at loading this data in once a day (or perhaps a few
times throughout the day). Unless there's a good way to stream in the above types of sources?
>> My question is - what are the current recommended practices for loading in the above?
>> With the CSV file, could we split it up, to parallelise the load? How would we do
this in Spark?
>> And with the Django app - I'm guessing I can either use Django's in-built ORM, or
we could query the PostgreSQL database directly? Any pros/cons of either approach? Or should
I be investigating something like Sqoop (or whatever the Spark equivalent tool is?).
>> Cheers,
>> Victor

View raw message