spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: using Spark or pig group by efficient in my use case?
Date Sun, 09 Aug 2015 09:50:55 GMT
Why not give it a shot? Spark always outruns old mapreduce jobs.

Thanks
Best Regards

On Sat, Aug 8, 2015 at 8:25 AM, linlma <linlma@gmail.com> wrote:

> I have a tens of million records, which is customer ID and city ID pair.
> There are tens of millions of unique customer ID, and only a few hundreds
> unique city ID. I want to do a merge to get all city ID aggregated for a
> specific customer ID, and pull back all records. I want to do this using
> group by customer ID using Pig on Hadoop, and wondering if it is the most
> efficient way.
>
> Also wondering if there are overhead for sorting in Hadoop (I do not care
> if
> customer1 before customer2 or not, as long as all city are aggregated
> correctly for customer1 and customer 2)? Do you think Spark is better?
>
> Here is an example of inputs,
>
> CustomerID1 City1
> CustomerID2 City2
> CustomerID3 City1
> CustomerID1 City3
> CustomerID2 City4
> I want output like this,
>
> CustomerID1 City1 City3
> CustomerID2 City2 City4
> CustomerID3 City1
>
> thanks in advance,
> Lin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/using-Spark-or-pig-group-by-efficient-in-my-use-case-tp24178.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message