spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From El-Hassan Wanas <elhassan.wa...@gmail.com>
Subject Re: Spark JDBC reads
Date Tue, 07 Mar 2017 18:41:59 GMT
I was kind of hoping that I would use Spark in this instance to generate
that intermediate SQL as part of its workflow strategy. Sort of as a
database independent way of doing my preprocessing.
Is there any way that allows me to capture the generated SQL from catalyst?
If so I would just use JDBCRdd with that.

The other option being to generate that SQL in text format which isn't the
nicest thing to do.

On Mar 7, 2017 5:02 PM, "Subhash Sriram" <subhash.sriram@gmail.com> wrote:

> Could you create a view of the table on your JDBC data source and just
> query that from Spark?
>
> Thanks,
> Subhash
>
> Sent from my iPhone
>
> > On Mar 7, 2017, at 6:37 AM, El-Hassan Wanas <elhassan.wanas@gmail.com>
> wrote:
> >
> > As an example, this is basically what I'm doing:
> >
> >      val myDF = originalDataFrame.select(col(columnName).when(col(columnName)
> === "foobar", 0).when(col(columnName) === "foobarbaz", 1))
> >
> > Except there's much more columns and much more conditionals. The
> generated Spark workflow starts with an SQL that basically does:
> >
> >    SELECT columnName, columnName2, etc. from table;
> >
> > Then the conditionals/transformations are evaluated on the cluster.
> >
> > Is there a way from the DataSet API to force the computation to happen
> on the SQL data source in this case? Or should I work with JDBCRDD and use
> createDataFrame on that?
> >
> >
> >> On 03/07/2017 02:19 PM, Jörn Franke wrote:
> >> Can you provide some source code? I am not sure I understood the
> problem .
> >> If you want to do a preprocessing at the JDBC datasource then you can
> write your own data source. Additionally you may want to modify the sql
> statement to extract the data in the right format and push some
> preprocessing to the database.
> >>
> >>> On 7 Mar 2017, at 12:04, El-Hassan Wanas <elhassan.wanas@gmail.com>
> wrote:
> >>>
> >>> Hello,
> >>>
> >>> There is, as usual, a big table lying on some JDBC data source. I am
> doing some data processing on that data from Spark, however, in order to
> speed up my analysis, I use reduced encodings and minimize the general size
> of the data before processing.
> >>>
> >>> Spark has been doing a great job at generating the proper workflows
> that do that preprocessing for me, but it seems to generate those workflows
> for execution on the Spark Cluster. The issue with that is the large
> transfer cost is still incurred.
> >>>
> >>> Is there any way to force Spark to run the preprocessing on the JDBC
> data source and get the prepared output DataFrame instead?
> >>>
> >>> Thanks,
> >>>
> >>> Wanas
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>

Mime
View raw message