spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Reiss <char...@eecs.berkeley.edu>
Subject Re: RDDs with no partitions
Date Fri, 23 Aug 2013 18:29:41 GMT
On 8/23/13 11:07 , Reynold Xin wrote:
> But is there any reason to do the handling of those beyond runJob? 

I think the zero-partition RDD might be used in a shuffle with a non-zero
output partition count (e.g. emptyRdd.reduceByKey(f, 1)).

- Charles

> 
> On Fri, Aug 23, 2013 at 11:04 AM, Charles Reiss <charles@eecs.berkeley.edu
> <mailto:charles@eecs.berkeley.edu>> wrote:
> 
>     On 8/22/13 22:57 , Reynold Xin wrote:
>     > I actually don't think there is any reason to have 0 partition stages, be it
>     > either result stage or shufflemap.
>     >
>     > It looks like Charles added those. Charles, any comments?
> 
>     One can get 0-partition RDDs (and thus 0-partition stages of either type)
>     pretty easily with PartitionPruningRDD. Given that, e.g., Shark uses this with
>     partition statistics, I imagine that real programs can hit the 0-partition
>     stage case this way.
> 
>     One can also get a 0-partition RDD from sc.textFile() on an empty directory,
>     and presumably some uses of hadoopFile/etc., though I won't claim that these
>     are important to support.
> 
>     - Charles
> 
>     >
>     >
>     >
>     > On Thu, Aug 22, 2013 at 10:51 PM, Mark Hamstra <mark@clearstorydata.com
>     <mailto:mark@clearstorydata.com>
>     > <mailto:mark@clearstorydata.com <mailto:mark@clearstorydata.com>>>
wrote:
>     >
>     >     We already do a quick, no-op return from DAGScheduler.runJob when
>     there are
>     >     no partitions submitted with the job, so running a job with no
>     partitions
>     >     in the usual way isn't a problem.  That still leaves at least the "zero
>     >     split job" in the DAGSchedulerSuite and the possibility of shuffleMap
>     >     stages with no partitions.  Is "zero split job" testing anything
>     >     meaningful, or is its only purpose to cause me headaches?  Can
>     shuffleMap
>     >     stages actually have no partitions, or is this (also) a distraction
>     posing
>     >     as a legitimate problem?
>     >
>     >     In short, when are RDDs with no partitions real things that we actually
>     >     have to deal with?
>     >
>     >
>     >
>     >     On Thu, Aug 22, 2013 at 9:20 PM, Reynold Xin <reynoldx@gmail.com
>     <mailto:reynoldx@gmail.com>
>     >     <mailto:reynoldx@gmail.com <mailto:reynoldx@gmail.com>>>
wrote:
>     >
>     >     > Being the guy that added the empty partition rdd, I second your
>     idea that
>     >     > we should just short-circuit those in DAGScheduler.runJob.
>     >     >
>     >     >
>     >     >
>     >     >
>     >     > On Thu, Aug 22, 2013 at 8:26 PM, Mark Hamstra
>     <mark@clearstorydata.com <mailto:mark@clearstorydata.com>
>     >     <mailto:mark@clearstorydata.com <mailto:mark@clearstorydata.com>>
>     >     > >wrote:
>     >     >
>     >     > > So how do these get created, and are we really handling them
>     correctly?
>     >     > >  What is prompting my questions is that I'm looking at making
>     sure that
>     >     > the
>     >     > > various data structures in the DAGScheduler shrink when appropriate
>     >     > instead
>     >     > > of growing without bounds.  Jobs with no partitions and the
>     "zero split
>     >     > > job" test in the DAGSchedulerSuite really throw a wrench into
>     the works.
>     >     > >  That's because in the DAGScheduler we go part way along in
>     handling this
>     >     > > weird case as though it were a normal job submission, we start
>     >     > initializing
>     >     > > or adding to various data structures, etc.; then we pretty much
>     bail out
>     >     > in
>     >     > > submitMissingTasks when we find out that there actually are no
>     tasks to
>     >     > be
>     >     > > done.  We remove the stage from the set of running stages, but
>     we don't
>     >     > > ever clean up pendingTasks, activeJobs, stageIdToStage,
>     stageToInfos, and
>     >     > > others because no tasks are ever submitted for the stage, so
>     there are
>     >     > > never any completion events, nor is the stage aborted -- i.e.
>     the normal
>     >     > > paths to cleanup are never taken.  The end result is that shuffleMap
>     >     > stages
>     >     > > with no partitions (can these even occur?) never complete, and
>     job's with
>     >     > > no partitions would seem also to persist forever.
>     >     > >
>     >     > > In short, RDDs with no partitions do really weird things to the
>     >     > > DAGScheduler.
>     >     > >
>     >     > > So, if there is no way to effectively prevent the creation of
>     RDDs with
>     >     > no
>     >     > > partitions, is there any reason why we can't short-circuit their
>     handling
>     >     > > within the DAGScheduler so that data structures are never built
or
>     >     > > populated for these weird things, or must we add a bunch of
>     special-case
>     >     > > cleanup code to submitMissingStages?
>     >     > >
>     >     >
>     >
>     >
> 
> 


Mime
View raw message