spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nitin <nitin2go...@gmail.com>
Subject SchemaRDD partition on specific column values?
Date Thu, 04 Dec 2014 10:00:39 GMT
Hi All,

I want to hash partition (and then cache) a schema RDD in way that
partitions are based on hash of the values of a  column ("ID" column in my
case). 

e.g. if my table has "ID" column with values as 1,2,3,4,5,6,7,8,9 and
spark.sql.shuffle.partitions is configured as 3, then there should be 3
partitions and say for ID=1, all the tuples should be present in one
particular partition.

My actual use case is that I always get a query in which I have to join 2
cached tables on ID column, so it first partitions both tables on ID and
then apply JOIN and I want to avoid the partitioning based on ID by
preprocessing it (and then cache it).

Thanks in Advance



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message