spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subhash Sriram <subhash.sri...@gmail.com>
Subject Re: Spark JDBC reads
Date Tue, 07 Mar 2017 14:02:55 GMT
Could you create a view of the table on your JDBC data source and just query that from Spark?

Thanks,
Subhash 

Sent from my iPhone

> On Mar 7, 2017, at 6:37 AM, El-Hassan Wanas <elhassan.wanas@gmail.com> wrote:
> 
> As an example, this is basically what I'm doing:
> 
>      val myDF = originalDataFrame.select(col(columnName).when(col(columnName) === "foobar",
0).when(col(columnName) === "foobarbaz", 1))
> 
> Except there's much more columns and much more conditionals. The generated Spark workflow
starts with an SQL that basically does:
> 
>    SELECT columnName, columnName2, etc. from table;
> 
> Then the conditionals/transformations are evaluated on the cluster.
> 
> Is there a way from the DataSet API to force the computation to happen on the SQL data
source in this case? Or should I work with JDBCRDD and use createDataFrame on that?
> 
> 
>> On 03/07/2017 02:19 PM, Jörn Franke wrote:
>> Can you provide some source code? I am not sure I understood the problem .
>> If you want to do a preprocessing at the JDBC datasource then you can write your
own data source. Additionally you may want to modify the sql statement to extract the data
in the right format and push some preprocessing to the database.
>> 
>>> On 7 Mar 2017, at 12:04, El-Hassan Wanas <elhassan.wanas@gmail.com> wrote:
>>> 
>>> Hello,
>>> 
>>> There is, as usual, a big table lying on some JDBC data source. I am doing some
data processing on that data from Spark, however, in order to speed up my analysis, I use
reduced encodings and minimize the general size of the data before processing.
>>> 
>>> Spark has been doing a great job at generating the proper workflows that do that
preprocessing for me, but it seems to generate those workflows for execution on the Spark
Cluster. The issue with that is the large transfer cost is still incurred.
>>> 
>>> Is there any way to force Spark to run the preprocessing on the JDBC data source
and get the prepared output DataFrame instead?
>>> 
>>> Thanks,
>>> 
>>> Wanas
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message