spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chetan Khatri <chetan.opensou...@gmail.com>
Subject Re: Performance Improvement: Collect in spark taking huge time
Date Thu, 06 May 2021 02:52:02 GMT
Hi All,

Do you think, replacing the collect() (for having scala collection for
loop) with below codeblock will have any benefit?

cachedColumnsAddTableDF.select("reporting_table").distinct().foreach(r => {
  r.getAs("reporting_table").asInstanceOf[String]
})


On Wed, May 5, 2021 at 10:15 PM Chetan Khatri <chetan.opensource@gmail.com>
wrote:

> Hi All, Collect in spark is taking huge time. I want to get list of values
> of one column to Scala collection. How can I do this?
>  val newDynamicFieldTablesDF = cachedPhoenixAppMetaDataForCreateTableDF
>             .select(col("reporting_table")).except(clientSchemaDF)
>           logger.info(s"####### except with client-schema done " +
> LocalDateTime.now())
>           // newDynamicFieldTablesDF.cache()
>
>
>             val detailsForCreateTableDF =
> cachedPhoenixAppMetaDataForCreateTableDF
>               .join(broadcast(newDynamicFieldTablesDF),
> Seq("reporting_table"), "inner")
>             logger.info(s"####### join with newDF done " +
> LocalDateTime.now())
>
> //            detailsForCreateTableDF.cache()
>
>             val newDynamicFieldTablesList = newDynamicFieldTablesDF.map(r
> => r.getString(0)).collect().toSet
>
>
> Later, I am iterating this list for one the use case to create a custom
> definition table:
>
> newDynamicFieldTablesList.foreach(table => { // running here Create table
> DDL/SQL query })
>
> Thank you so much
>

Mime
View raw message