spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anil Dasari <adas...@guidewire.com>
Subject Spark Pair RDD write to Hive
Date Sun, 05 Sep 2021 17:42:28 GMT
Hello,

I have a use case where users of group id are persisted to hive table.

// pseudo code looks like below
usersRDD = sc.parallelize(..)
usersPairRDD = usersRDD.map(u => (u.groupId, u))
groupedUsers = usersPairRDD.groupByKey()

Can I save groupedUsers RDD into hive tables where table name is key of groupedUsers entry
?

I want to avoid below approach as it is not scalable solution where papalism is limited with
driver cores –

groupIds = usersRDD.map(u => u.groupId).distinct.collect.toList

groupIds.par.map(id => {
  rdd = usersRDD.filter(u => u.groupId == id).cache
// create dataframe
// persist df to hive table using df.write.saveAsTable
)

Could you suggest better approach ? thanks in advance.

-
Anil
Mime
View raw message