spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <>
Subject Re: Replacing Spark's native scheduler with Sparrow
Date Sat, 08 Nov 2014 05:23:52 GMT
Hmm, relevant quote from section 3.3:

newer frameworks like Spark [35] reduce the overhead to 5ms. To support
> tasks that complete in hundreds of mil- liseconds, we argue for reducing
> task launch overhead even further to 1ms so that launch overhead
> constitutes at most 1% of task runtime. By maintaining an active thread
> pool for task execution on each worker node and caching binaries, task
> launch overhead can be reduced to the time to make a remote procedure call
> to the slave machine to launch the task. Today’s datacenter networks easily
> allow a RPC to complete within 1ms. In fact, re- cent work showed that 10μs
> RPCs are possible in the short term [26]; thus, with careful engineering,
> we be- lieve task launch overheads of 50μ s are attainable. 50μ s task
> launch overheads would enable even smaller tasks that could read data from
> in-memory or from flash stor- age in order to complete in milliseconds.

So it looks like I misunderstood the current cost of task initialization.
It's already as low as 5ms (and not 100ms)?


On Fri, Nov 7, 2014 at 11:15 PM, Shivaram Venkataraman <> wrote:

> On Fri, Nov 7, 2014 at 8:04 PM, Nicholas Chammas <
>> wrote:
>> Sounds good. I'm looking forward to tracking improvements in this area.
>> Also, just to connect some more dots here, I just remembered that there is
>> currently an initiative to add an IndexedRDD
>> <> interface. Some
>> interesting use cases mentioned there include (emphasis added):
>> To address these problems, we propose IndexedRDD, an efficient key-value
>> > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing
>> > key uniqueness and pre-indexing the entries for efficient joins and
>> *point
>> > lookups, updates, and deletions*.
>> GraphX would be the first user of IndexedRDD, since it currently
>> implements
>> > a limited form of this functionality in VertexRDD. We envision a
>> variety of
>> > other uses for IndexedRDD, including *streaming updates* to RDDs,
>> *direct
>> > serving* from RDDs, and as an execution strategy for Spark SQL.
>> Maybe some day we'll have Spark clusters directly serving up point lookups
>> or updates. I imagine the tasks running on clusters like that would be
>> tiny
>> and would benefit from very low task startup times and scheduling latency.
>> Am I painting that picture correctly?
>> Yeah - we painted a similar picture in a short paper last year titled
> "The Case for Tiny Tasks in Compute Clusters"
>> Anyway, thanks for explaining the current status of Sparrow.
>> Nick

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message