spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jain, Nishit" <nja...@underarmour.com>
Subject Re: How do I convert a data frame to broadcast variable?
Date Thu, 03 Nov 2016 16:32:48 GMT
Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few columns of a table
(not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only certain columns:
def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties):
DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query (instead of full
table):
new JdbcRDD(sc: SparkContext, getConnection: () ⇒ Connection, sql: String, lowerBound: Long,
upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ⇒ T = JdbcRDD.resultSetToObjectArray)(implicit
arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will spark be smart
enough to fetch only two columns not entire table?
Any way to test this?


From: Denny Lee <denny.g.lee@gmail.com<mailto:denny.g.lee@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" <njain1@underarmour.com<mailto:njain1@underarmour.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>"
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a BroadcastHashJoin
so that way you can join to that table presuming its small enough to distributed?  Here's
a handy guide on a BroadcastHashJoin: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit <njain1@underarmour.com<mailto:njain1@underarmour.com>>
wrote:
I have a lookup table in HANA database. I want to create a spark broadcast variable for it.
What would be the suggested approach? Should I read it as an data frame and convert data frame
into broadcast variable?

Thanks,
Nishit
Mime
View raw message