spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rex X <dnsr...@gmail.com>
Subject Re: How to do this pairing in Spark?
Date Fri, 26 Aug 2016 11:28:46 GMT
Hi Ayan,

Yes, ID=3 can be paired with ID=1, and the same for ID=9 with ID=8. BUT we
want to keep only ONE pair for the ID with Flag=0.

Since ID=1 with Flag=0 already paired with ID=2, and ID=8 paired with ID=7,
we simply delete ID=3 and ID=9.

Thanks!

Regards,
Rex


On Fri, Aug 26, 2016 at 12:46 AM, ayan guha <guha.ayan@gmail.com> wrote:

> Why 3 and 9 should be deleted? 3 can be paired with 1and 9 can be paired
> with 8.
> On 26 Aug 2016 11:00, "Rex X" <dnsring@gmail.com> wrote:
>
>> 1. Given following CSV file
>>
>> >     $cat data.csv
>> >
>> >     ID,City,Zip,Flag
>> >     1,A,95126,0
>> >     2,A,95126,1
>> >     3,A,95126,1
>> >     4,B,95124,0
>> >     5,B,95124,1
>> >     6,C,95124,0
>> >     7,C,95127,1
>> >     8,C,95127,0
>> >     9,C,95127,1
>>
>>
>> (a) where "ID" above is a primary key (unique),
>>
>> (b) for each "City" and "Zip" combination, there is one ID in max with
>> Flag=0; while it can contain multiple IDs with Flag=1 for each "City" and
>> "Zip" combination.
>>
>> (c) Flag can be 0 or 1
>>
>>
>> 2. For each ID with Flag=0, we want to pair it with another ID with
>> Flag=1 but with the same City - Zip. If one cannot find another paired ID
>> with Flag=1 and matched City - Zip, we just delete that record.
>>
>> Here is the expected result:
>>
>> >     ID,City,Zip,Flag
>> >     1,A,95126,0
>> >     2,A,95126,1
>> >     4,B,95124,0
>> >     5,B,95124,1
>> >     7,C,95127,1
>> >     8,C,95127,0
>>
>>
>> Any valuable tips how to do this pairing in Python or Scala?
>>
>> Great thanks!
>>
>> Rex
>>
>

Mime
View raw message