spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jain, Nishit" <nja...@underarmour.com>
Subject Re: How do I convert a data frame to broadcast variable?
Date Fri, 04 Nov 2016 14:00:40 GMT
Awesome, thanks Silvio!

From: Silvio Fiorito <silvio.fiorito@granturing.com<mailto:silvio.fiorito@granturing.com>>
Date: Thursday, November 3, 2016 at 12:26 PM
To: "Jain, Nishit" <njain1@underarmour.com<mailto:njain1@underarmour.com>>, Denny
Lee <denny.g.lee@gmail.com<mailto:denny.g.lee@gmail.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>"
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?


Hi Nishit,


Yes the JDBC connector supports predicate pushdown and column pruning. So any selection you
make on the dataframe will get materialized in the query sent via JDBC.


You should be able to verify this by looking at the physical query plan:


val df = sqlContext.jdbc(....).select($"col1", $"col2")

df.explain(true)


Or if you can easily log queries submitted to your database then you can view the specific
query.


Thanks,

Silvio

________________________________
From: Jain, Nishit <njain1@underarmour.com<mailto:njain1@underarmour.com>>
Sent: Thursday, November 3, 2016 12:32:48 PM
To: Denny Lee; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: How do I convert a data frame to broadcast variable?

Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few columns of a table
(not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only certain columns:
def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties):
DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query (instead of full
table):
new JdbcRDD(sc: SparkContext, getConnection: () ⇒ Connection, sql: String, lowerBound: Long,
upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ⇒ T = JdbcRDD.resultSetToObjectArray)(implicit
arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will spark be smart
enough to fetch only two columns not entire table?
Any way to test this?


From: Denny Lee <denny.g.lee@gmail.com<mailto:denny.g.lee@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" <njain1@underarmour.com<mailto:njain1@underarmour.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>"
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a BroadcastHashJoin
so that way you can join to that table presuming its small enough to distributed?  Here's
a handy guide on a BroadcastHashJoin: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit <njain1@underarmour.com<mailto:njain1@underarmour.com>>
wrote:
I have a lookup table in HANA database. I want to create a spark broadcast variable for it.
What would be the suggested approach? Should I read it as an data frame and convert data frame
into broadcast variable?

Thanks,
Nishit
Mime
View raw message