spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deenar Toraskar <deenar.toras...@gmail.com>
Subject Re: Adding an indexed column
Date Thu, 04 Jun 2015 13:20:11 GMT
or you could

1) convert dataframe to RDD
2) use mapPartitions and zipWithIndex within each partition
3) convert RDD back to dataframe you will need to make sure you preserve
partitioning

Deenar

On 1 June 2015 at 02:23, ayan guha <guha.ayan@gmail.com> wrote:

> If you are on spark 1.3, use repartitionandSort followed by mappartition.
> In 1.4, window functions will be supported, it seems
> On 1 Jun 2015 04:10, "Ricardo Almeida" <ricardo.almeida@actnowib.com>
> wrote:
>
>> That's great and how would you create an ordered index by partition (by
>> product in this example)?
>>
>> Assuming now a dataframe like:
>>
>> flag | product | price
>> ----------------------
>> 1    |       a |47.808764653746
>> 1    |       b |47.808764653746
>> 1    |       a |31.9869279512204
>> 1    |       b |47.7907893713564
>> 1    |       a |16.7599200038239
>> 1    |       b |16.7599200038239
>> 1    |       b |20.3916014172137
>>
>>
>> get a new dataframe such as:
>>
>> flag | product | price | index
>> ----------------------
>> 1    |       a |47.808764653746  | 0
>> 1    |       a |31.9869279512204 | 1
>> 1    |       a |16.7599200038239 | 2
>> 1    |       b |47.808764653746  | 0
>> 1    |       b |47.7907893713564 | 1
>> 1    |       b |20.3916014172137 | 2
>> 1    |       b |16.7599200038239 | 3
>>
>>
>>
>>
>>
>>
>>
>>
>> On 29 May 2015 at 12:25, Wesley Miao <wesley.miao1@gmail.com> wrote:
>>
>>> One way I can see is to -
>>>
>>> 1. get rdd from your df
>>> 2. call rdd.zipWithIndex to get a new rdd
>>> 3. turn your new rdd to a new df
>>>
>>> On Fri, May 29, 2015 at 5:43 AM, Cesar Flores <cesar7@gmail.com> wrote:
>>>
>>>>
>>>> Assuming that I have the next data frame:
>>>>
>>>> flag | price
>>>> ----------------------
>>>> 1    |47.808764653746
>>>> 1    |47.808764653746
>>>> 1    |31.9869279512204
>>>> 1    |47.7907893713564
>>>> 1    |16.7599200038239
>>>> 1    |16.7599200038239
>>>> 1    |20.3916014172137
>>>>
>>>> How can I create a data frame with an extra indexed column as the next
>>>> one:
>>>>
>>>> flag | price          | index
>>>> ----------------------|-------
>>>> 1    |47.808764653746 | 0
>>>> 1    |47.808764653746 | 1
>>>> 1    |31.9869279512204| 2
>>>> 1    |47.7907893713564| 3
>>>> 1    |16.7599200038239| 4
>>>> 1    |16.7599200038239| 5
>>>> 1    |20.3916014172137| 6
>>>>
>>>> --
>>>> Cesar Flores
>>>>
>>>
>>>
>>

Mime
View raw message