spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Сергей Лихоман <>
Subject Compact RDD representation
Date Sun, 19 Jul 2015 17:40:23 GMT

I am looking for suitable issue for Master Degree project(it sounds like
scalability problems and improvements for spark streaming) and seems like
introduction of grouped RDD(for example: don't store
"Spark", "Spark", "Spark", instead store ("Spark", 3)) can:

1. Reduce memory needed for RDD (roughly, used memory will be:  % of uniq
2. Improve performance(no need to apply function several times for the same

Can I create ticket and introduce API for grouped RDDs? Is it make sense?
Also I will be very appreciated for critic and ideas

View raw message