spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akshay Bhardwaj <akshay.bhardwaj1...@gmail.com>
Subject Re: adding a column to a groupBy (dataframe)
Date Thu, 06 Jun 2019 09:51:01 GMT
Additionally there is "uuid" function available as well if that helps your
use case.


Akshay Bhardwaj
+91-97111-33849


On Thu, Jun 6, 2019 at 3:18 PM Akshay Bhardwaj <
akshay.bhardwaj1988@gmail.com> wrote:

> Hi Marcelo,
>
> If you are using spark 2.3+ and dataset API/SparkSQL,you can use this
> inbuilt function "monotonically_increasing_id" in Spark.
> A little tweaking using Spark sql inbuilt functions can enable you to
> achieve this without having to write code or define RDDs with map/reduce
> functions.
>
> Akshay Bhardwaj
> +91-97111-33849
>
>
> On Thu, May 30, 2019 at 4:05 AM Marcelo Valle <marcelo.valle@ktech.com>
> wrote:
>
>> Hi all,
>>
>> I am new to spark and I am trying to write an application using
>> dataframes that normalize data.
>>
>> So I have a dataframe `denormalized_cities` with 3 columns:  COUNTRY,
>> CITY, CITY_NICKNAME
>>
>> Here is what I want to do:
>>
>>
>>    1. Map by country, then for each country generate a new ID and write
>>    to a new dataframe `countries`, which would have COUNTRY_ID, COUNTRY -
>>    country ID would be generated, probably using `monotonically_increasing_id`.
>>    2. For each country, write several lines on a new dataframe `cities`,
>>    which would have COUNTRY_ID, ID, CITY, CITY_NICKNAME. COUNTRY_ID would be
>>    the same generated on country table and ID would be another ID I generate.
>>
>> What's the best way to do this, hopefully using only dataframes (no low
>> level RDDs) unless it's not possible?
>>
>> I clearly see a MAP/Reduce process where for each KEY mapped I generate a
>> row in countries table with COUNTRY_ID and for every value I write a row in
>> cities table. But how to implement this in an easy and efficient way?
>>
>> I thought about using a `GroupBy Country` and then using `collect` to
>> collect all values for that country, but then I don't know how to generate
>> the country id and I am not sure about memory efficiency of `collect` for a
>> country with too many cities (bare in mind country/city is just an example,
>> my real entities are different).
>>
>> Could anyone point me to the direction of a good solution?
>>
>> Thanks,
>> Marcelo.
>>
>> This email is confidential [and may be protected by legal privilege]. If
>> you are not the intended recipient, please do not copy or disclose its
>> content but contact the sender immediately upon receipt.
>>
>> KTech Services Ltd is registered in England as company number 10704940.
>>
>> Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE,
>> United Kingdom
>>
>

Mime
View raw message