spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick McCarthy <pmccar...@dstillery.com.INVALID>
Subject Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"
Date Fri, 13 Sep 2019 17:20:23 GMT
If you only care that you're deduping on one of the fields you could add an
index and count like so:

df3 = df1.withColumn('idx',lit(1))
.union(df2.withColumn('idx',lit(2))

remove_df = df3
.groupBy('id')
.agg(collect_set('idx').alias('set_size')
.filter(size(col('set_size') > 1))
.select('id', lit(2).alias('idx'))

# the duplicated ids in the above are now coded for df2, so only those will
be dropped

df3.join(remove_df, on=['id','idx'], how='leftanti')

On Fri, Sep 13, 2019 at 11:44 AM Abhinesh Hada <abhineshada@gmail.com>
wrote:

> Hi,
>
> I am trying to take union of 2 dataframes and then drop duplicate based on
> the value of a specific column. But, I want to make sure that while
> dropping duplicates, the rows from first data frame are kept.
>
> Example:
> df1 = df1.union(df2).dropDuplicates(['id'])
>
>
>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Mime
View raw message