spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dsiegel <>
Subject Re: Does filter on an RDD scan every data item ?
Date Thu, 11 Dec 2014 20:38:13 GMT
Also, you may want to use .lookup() instead of .filter()

lookup(key: K): Seq[V]
Return the list of values in the RDD for key key. This operation is done
efficiently if the RDD has a known partitioner by only searching the
partition that the key maps to.

You might want to partition your first batch of data with .partitionBy()
using your CustomTuple hash implementation, persist it, and do not run any
operations on it which can remove it's partitioner object.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message