spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Luthra <>
Subject GC problem doing fuzzy join
Date Tue, 18 Jun 2019 19:18:16 GMT
I'm trying to do a brute force fuzzy join where I compare N records against
N other records, for N^2 total comparisons.

The table is medium size and fits in memory, so I collect it and put it
into a broadcast variable.

The other copy of the table is in an RDD. I am basically calling the RDD
map operation, and each record in the RDD takes the broadcasted table and
FILTERS it. There appears to be large GC happening, so I suspect that huge
repeated data deletion of copies of the broadcast table is causing GC.

Is there a way to fix this pattern?


View raw message