spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kumar sp <>
Subject Design recommendation
Date Wed, 13 Feb 2019 16:07:42 GMT
Hello I  need a design recommendation.

I need to calcualte a couple of calculations with min shuffling and better
perf. I have an nested structure with say a class have n number of students
and structure will be similiar to this

{ classId: String,

now i have to validate say class will have students who should be all part
of same area code and another one student who is taking more than one class.
I can create groupby classId and count(AreaCode) get classID, count..
similiarly groupby StudentID and count(Class_Id)  get aggregated structure
and join these two with say studentId but this is taking multiple
shuffles and data is huge so cant really use broadcast join .

Can you please suggest some better approach.


View raw message