spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nitin <>
Subject SchemaRDD partition on specific column values?
Date Thu, 04 Dec 2014 10:00:39 GMT
Hi All,

I want to hash partition (and then cache) a schema RDD in way that
partitions are based on hash of the values of a  column ("ID" column in my

e.g. if my table has "ID" column with values as 1,2,3,4,5,6,7,8,9 and
spark.sql.shuffle.partitions is configured as 3, then there should be 3
partitions and say for ID=1, all the tuples should be present in one
particular partition.

My actual use case is that I always get a query in which I have to join 2
cached tables on ID column, so it first partitions both tables on ID and
then apply JOIN and I want to avoid the partitioning based on ID by
preprocessing it (and then cache it).

Thanks in Advance

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message