spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akshay Bhardwaj <akshay.bhardwaj1...@gmail.com>
Subject Re: adding a column to a groupBy (dataframe)
Date Thu, 06 Jun 2019 09:48:42 GMT
Hi Marcelo,

If you are using spark 2.3+ and dataset API/SparkSQL,you can use this
inbuilt function "monotonically_increasing_id" in Spark.
A little tweaking using Spark sql inbuilt functions can enable you to
achieve this without having to write code or define RDDs with map/reduce
functions.

Akshay Bhardwaj
+91-97111-33849


On Thu, May 30, 2019 at 4:05 AM Marcelo Valle <marcelo.valle@ktech.com>
wrote:

> Hi all,
>
> I am new to spark and I am trying to write an application using dataframes
> that normalize data.
>
> So I have a dataframe `denormalized_cities` with 3 columns:  COUNTRY,
> CITY, CITY_NICKNAME
>
> Here is what I want to do:
>
>
>    1. Map by country, then for each country generate a new ID and write
>    to a new dataframe `countries`, which would have COUNTRY_ID, COUNTRY -
>    country ID would be generated, probably using `monotonically_increasing_id`.
>    2. For each country, write several lines on a new dataframe `cities`,
>    which would have COUNTRY_ID, ID, CITY, CITY_NICKNAME. COUNTRY_ID would be
>    the same generated on country table and ID would be another ID I generate.
>
> What's the best way to do this, hopefully using only dataframes (no low
> level RDDs) unless it's not possible?
>
> I clearly see a MAP/Reduce process where for each KEY mapped I generate a
> row in countries table with COUNTRY_ID and for every value I write a row in
> cities table. But how to implement this in an easy and efficient way?
>
> I thought about using a `GroupBy Country` and then using `collect` to
> collect all values for that country, but then I don't know how to generate
> the country id and I am not sure about memory efficiency of `collect` for a
> country with too many cities (bare in mind country/city is just an example,
> my real entities are different).
>
> Could anyone point me to the direction of a good solution?
>
> Thanks,
> Marcelo.
>
> This email is confidential [and may be protected by legal privilege]. If
> you are not the intended recipient, please do not copy or disclose its
> content but contact the sender immediately upon receipt.
>
> KTech Services Ltd is registered in England as company number 10704940.
>
> Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE,
> United Kingdom
>

Mime
View raw message