spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugene Morozov <>
Subject Re: using Spark or pig group by efficient in my use case?
Date Thu, 13 Aug 2015 13:24:05 GMT
I’d say spark will be faster in this case, because it avoids storing intermediate data to
disk after map and before reduce tasks.
It’ll be faster even if you use Combiner (I’d assume Pig is able to figure that out).

Hard to say how much faster as it’ll depend on disks available (ssd vs sshd vs hdd), size
of data, etc.

Although, the only experiment can reveal the truth =)

On 08 Aug 2015, at 05:55, linlma <> wrote:

> I have a tens of million records, which is customer ID and city ID pair.
> There are tens of millions of unique customer ID, and only a few hundreds
> unique city ID. I want to do a merge to get all city ID aggregated for a
> specific customer ID, and pull back all records. I want to do this using
> group by customer ID using Pig on Hadoop, and wondering if it is the most
> efficient way.
> Also wondering if there are overhead for sorting in Hadoop (I do not care if
> customer1 before customer2 or not, as long as all city are aggregated
> correctly for customer1 and customer 2)? Do you think Spark is better?
> Here is an example of inputs,
> CustomerID1 City1
> CustomerID2 City2
> CustomerID3 City1
> CustomerID1 City3
> CustomerID2 City4
> I want output like this,
> CustomerID1 City1 City3
> CustomerID2 City2 City4
> CustomerID3 City1
> thanks in advance,
> Lin
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Eugene Morozov

View raw message