spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Everett Anderson <ever...@nuna.com.INVALID>
Subject Best way to assign a unique IDs to row groups
Date Wed, 01 Mar 2017 21:50:10 GMT
Hi,

I've used functions.monotonically_increasing_id() for assigning a unique ID
to all rows, but I'd like to assign a unique ID to each group of rows with
the same key.

The two ways I can think of to do this are

Option 1: Create separate group ID table and join back

   - Create a new data frame with the distinct values of the keys.
   - Add an ID column to it via monotonically_increasing_id.
   - Join this table back with the original to add the group ID. In this
   best case, this will be small enough to be a broadcast join.

Option 2: Add ID column / groupByKey / flatMapGroups

   - Add an ID column with monotonically_increasing_id
   - groupByKey
   - flatMapGroups and apply the first seen ID from the iterator to the
   other rows

Option 2 is a little annoying if you're dealing with Dataset[Row], as you
have to do a lot of work to get the fields out of the old Row objects and
create new ones.

Is there a better way?

Also, generally, while assigning a unique ID to all rows seems like a
commonly needed operation, there are comments in RDD.zipWithUniqueId as
well as monotonically_increasing_id that suggest these may not be
especially reliable in various cases. Do people hit those much?

Thanks!

- Everett

Mime
View raw message