spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruno Nassivet <bruno.nassi...@gmail.com>
Subject Re: adding a column to a groupBy (dataframe)
Date Thu, 06 Jun 2019 19:57:15 GMT
Hi Marcelo,

Maybe the spark.sql.functions.explode give what you need?

// Bruno


> Le 6 juin 2019 à 16:02, Marcelo Valle <marcelo.valle@ktech.com> a écrit :
> 
> Generating the city id (child) is easy, monotonically increasing id worked for me. 
> 
> The problem is the country (parent) which has to be in both countries and cities data
frames.
> 
> 
> 
> On Thu, 6 Jun 2019 at 14:57, Magnus Nilsson <magnn@kth.se <mailto:magnn@kth.se>>
wrote:
> Well, you could do a repartition on cityname/nrOfCities and use the spark_partition_id
function or the mappartitionswithindex dataframe method to add a city Id column. Then just
split the dataframe into two subsets. Be careful of hashcollisions on the reparition Key though,
or more than one city might end up in the same partition (you can use a custom partitioner).
> 
> It all depends on what kind of Id you want/need for the city value. I.e. will you later
need to append new city Id:s or not. Do you always handle the entire dataset when you make
this change or not.
> 
> On the other hand, getting a distinct list of citynames is a non shuffling fast operation,
add a row_number column and do a broadcast join with the original dataset and then split into
two subsets. Probably a bit faster than reshuffling the entire dataframe. As always the proof
is in the pudding.
> 
> //Magnus
> 
> On Thu, Jun 6, 2019 at 2:53 PM Marcelo Valle <marcelo.valle@ktech.com <mailto:marcelo.valle@ktech.com>>
wrote:
> Akshay, 
> 
> First of all, thanks for the answer. I *am* using monotonically increasing id, but that's
not my problem. 
> My problem is I want to output 2 tables from 1 data frame, 1 parent table with ID for
the group by and 1 child table with the parent id without the group by.
> 
> I was able to solve this problem by grouping by, generating a parent data frame with
an id, then joining the parent dataframe with the original one to get a child dataframe with
a parent id. 
> 
> I would like to find a solution without this second join, though.
> 
> Thanks,
> Marcelo.
> 
> 
> On Thu, 6 Jun 2019 at 10:49, Akshay Bhardwaj <akshay.bhardwaj1988@gmail.com <mailto:akshay.bhardwaj1988@gmail.com>>
wrote:
> Hi Marcelo,
> 
> If you are using spark 2.3+ and dataset API/SparkSQL,you can use this inbuilt function
"monotonically_increasing_id" in Spark.
> A little tweaking using Spark sql inbuilt functions can enable you to achieve this without
having to write code or define RDDs with map/reduce functions.
> 
> Akshay Bhardwaj
> +91-97111-33849
> 
> 
> On Thu, May 30, 2019 at 4:05 AM Marcelo Valle <marcelo.valle@ktech.com <mailto:marcelo.valle@ktech.com>>
wrote:
> Hi all, 
> 
> I am new to spark and I am trying to write an application using dataframes that normalize
data. 
> 
> So I have a dataframe `denormalized_cities` with 3 columns:  COUNTRY, CITY, CITY_NICKNAME
> 
> Here is what I want to do: 
> 
> Map by country, then for each country generate a new ID and write to a new dataframe
`countries`, which would have COUNTRY_ID, COUNTRY - country ID would be generated, probably
using `monotonically_increasing_id`.
> For each country, write several lines on a new dataframe `cities`, which would have COUNTRY_ID,
ID, CITY, CITY_NICKNAME. COUNTRY_ID would be the same generated on country table and ID would
be another ID I generate. 
> What's the best way to do this, hopefully using only dataframes (no low level RDDs) unless
it's not possible?
> 
> I clearly see a MAP/Reduce process where for each KEY mapped I generate a row in countries
table with COUNTRY_ID and for every value I write a row in cities table. But how to implement
this in an easy and efficient way? 
> 
> I thought about using a `GroupBy Country` and then using `collect` to collect all values
for that country, but then I don't know how to generate the country id and I am not sure about
memory efficiency of `collect` for a country with too many cities (bare in mind country/city
is just an example, my real entities are different).
> 
> Could anyone point me to the direction of a good solution?
> 
> Thanks,
> Marcelo.
> 
> This email is confidential [and may be protected by legal privilege]. If you are not
the intended recipient, please do not copy or disclose its content but contact the sender
immediately upon receipt.
> 
> KTech Services Ltd is registered in England as company number 10704940.
> 
> Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United Kingdom
> 
> 
> This email is confidential [and may be protected by legal privilege]. If you are not
the intended recipient, please do not copy or disclose its content but contact the sender
immediately upon receipt.
> 
> KTech Services Ltd is registered in England as company number 10704940.
> 
> Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United Kingdom
> 
> 
> This email is confidential [and may be protected by legal privilege]. If you are not
the intended recipient, please do not copy or disclose its content but contact the sender
immediately upon receipt.
> 
> KTech Services Ltd is registered in England as company number 10704940.
> 
> Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United Kingdom
> 


Mime
View raw message