spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Weald <r...@weald.com>
Subject Re: Spark - Loading in data from CSVs and Postgres
Date Tue, 22 Oct 2013 15:20:50 GMT
If you are using HDFS you also have the option of using Apache Sqoop to
load data from you SQL database into HDFS in TSV or CSV format. Once it is
on HDFS including it in a spark job would be trivial.

-Ryan


On Fri, Oct 18, 2013 at 6:14 AM, Chester <chesterxgchen@yahoo.com> wrote:

> There is a Hcatalog project which provides the abstraction layer for
> different types file formats including csv as well as SQL. I don't know if
> this works well with spark or not. I posted the question few days ago about
> HCatalog, but did not get any response.
>
> Chester
>
> Sent from my iPad
>
> On Oct 18, 2013, at 4:18 AM, Vinay <vinay.cn@gmail.com> wrote:
>
> An option would be to use hdfs for loading CSV , and jdbc support to load
> tables from Postgres.
>
>
> Regards,
> Vinay
>
> On Oct 18, 2013, at 1:24 AM, Victor Hooi < <victorhooi@yahoo.com>
> victorhooi@yahoo.com> wrote:
>
> Hi,
>
>  *NB: I originally posted this to the Google Group, before I saw the
> message about how we're moving to the Apache Incubator mailing list.*
>
> I'm new to Spark, and I wanted to get some advice on the best way to load
> our data into it:
>
>    1. A CSV file generated each day, which contain user click data
>    2. A Django app, which is running on top of PostgreSQL, containing
>    user and transaction data
>
> We do want the data load to be fairly quick, but we'd also want
> interactive queries to be fast, so if anybody can explain any tradeoffs in
> Spark we'd need to make on either, that would be good as well. I'd be
> leaning towards sacrificing load speed to speed up queries, for our use
> cases.
>
> I'm guessing we'd be looking at loading this data in once a day (or
> perhaps a few times throughout the day). Unless there's a good way to
> stream in the above types of sources?
>
> My question is - what are the current recommended practices for loading in
> the above?
>
> With the CSV file, could we split it up, to parallelise the load? How
> would we do this in Spark?
>
> And with the Django app - I'm guessing I can either use Django's in-built
> ORM, or we could query the PostgreSQL database directly? Any pros/cons of
> either approach? Or should I be investigating something like Sqoop (or
> whatever the Spark equivalent tool is?).
>
> Cheers,
> Victor
>
>

Mime
View raw message