spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kumar sp <kra18...@gmail.com>
Subject Design recommendation
Date Wed, 13 Feb 2019 16:07:42 GMT
Hello I  need a design recommendation.

I need to calcualte a couple of calculations with min shuffling and better
perf. I have an nested structure with say a class have n number of students
and structure will be similiar to this

{ classId: String,
StudendId:String,
Score:Int,
AreaCode:String}

now i have to validate say class will have students who should be all part
of same area code and another one student who is taking more than one class.
I can create groupby classId and count(AreaCode) get classID, count..
similiarly groupby StudentID and count(Class_Id)  get aggregated structure
and join these two with say studentId but this is taking multiple
shuffles and data is huge so cant really use broadcast join .

Can you please suggest some better approach.

Regards,
sk

Mime
View raw message