I am working toward running some of our Spark Streaming jobs on a cluster. However, I have not seen documentation on best practices for this. Here and there I have found some lore though:
1. Keeping task latency low is paramount. Spark master has lower task latency than Mesos, but "local" is the best.
2. It is possible to configure range partitioning so that ranges of keys for incoming events are sent to the same node for processing. This allows Spark Streaming to perform parallel computation using multiple nodes.
Here's what I need: What is the best way to configure a Spark Streaming job to use range partitioning, a la #2 above? I need the details: what has to be changed in the job's source code, whether to use "spark" master, etc.
Thanks in advance,