spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <reyno...@gmail.com>
Subject Re: RDDs with no partitions
Date Fri, 23 Aug 2013 18:07:41 GMT
But is there any reason to do the handling of those beyond runJob?


On Fri, Aug 23, 2013 at 11:04 AM, Charles Reiss
<charles@eecs.berkeley.edu>wrote:

> On 8/22/13 22:57 , Reynold Xin wrote:
> > I actually don't think there is any reason to have 0 partition stages,
> be it
> > either result stage or shufflemap.
> >
> > It looks like Charles added those. Charles, any comments?
>
> One can get 0-partition RDDs (and thus 0-partition stages of either type)
> pretty easily with PartitionPruningRDD. Given that, e.g., Shark uses this
> with
> partition statistics, I imagine that real programs can hit the 0-partition
> stage case this way.
>
> One can also get a 0-partition RDD from sc.textFile() on an empty
> directory,
> and presumably some uses of hadoopFile/etc., though I won't claim that
> these
> are important to support.
>
> - Charles
>
> >
> >
> >
> > On Thu, Aug 22, 2013 at 10:51 PM, Mark Hamstra <mark@clearstorydata.com
> > <mailto:mark@clearstorydata.com>> wrote:
> >
> >     We already do a quick, no-op return from DAGScheduler.runJob when
> there are
> >     no partitions submitted with the job, so running a job with no
> partitions
> >     in the usual way isn't a problem.  That still leaves at least the
> "zero
> >     split job" in the DAGSchedulerSuite and the possibility of shuffleMap
> >     stages with no partitions.  Is "zero split job" testing anything
> >     meaningful, or is its only purpose to cause me headaches?  Can
> shuffleMap
> >     stages actually have no partitions, or is this (also) a distraction
> posing
> >     as a legitimate problem?
> >
> >     In short, when are RDDs with no partitions real things that we
> actually
> >     have to deal with?
> >
> >
> >
> >     On Thu, Aug 22, 2013 at 9:20 PM, Reynold Xin <reynoldx@gmail.com
> >     <mailto:reynoldx@gmail.com>> wrote:
> >
> >     > Being the guy that added the empty partition rdd, I second your
> idea that
> >     > we should just short-circuit those in DAGScheduler.runJob.
> >     >
> >     >
> >     >
> >     >
> >     > On Thu, Aug 22, 2013 at 8:26 PM, Mark Hamstra <
> mark@clearstorydata.com
> >     <mailto:mark@clearstorydata.com>
> >     > >wrote:
> >     >
> >     > > So how do these get created, and are we really handling them
> correctly?
> >     > >  What is prompting my questions is that I'm looking at making
> sure that
> >     > the
> >     > > various data structures in the DAGScheduler shrink when
> appropriate
> >     > instead
> >     > > of growing without bounds.  Jobs with no partitions and the
> "zero split
> >     > > job" test in the DAGSchedulerSuite really throw a wrench into
> the works.
> >     > >  That's because in the DAGScheduler we go part way along in
> handling this
> >     > > weird case as though it were a normal job submission, we start
> >     > initializing
> >     > > or adding to various data structures, etc.; then we pretty much
> bail out
> >     > in
> >     > > submitMissingTasks when we find out that there actually are no
> tasks to
> >     > be
> >     > > done.  We remove the stage from the set of running stages, but
> we don't
> >     > > ever clean up pendingTasks, activeJobs, stageIdToStage,
> stageToInfos, and
> >     > > others because no tasks are ever submitted for the stage, so
> there are
> >     > > never any completion events, nor is the stage aborted -- i.e.
> the normal
> >     > > paths to cleanup are never taken.  The end result is that
> shuffleMap
> >     > stages
> >     > > with no partitions (can these even occur?) never complete, and
> job's with
> >     > > no partitions would seem also to persist forever.
> >     > >
> >     > > In short, RDDs with no partitions do really weird things to the
> >     > > DAGScheduler.
> >     > >
> >     > > So, if there is no way to effectively prevent the creation of
> RDDs with
> >     > no
> >     > > partitions, is there any reason why we can't short-circuit their
> handling
> >     > > within the DAGScheduler so that data structures are never built
> or
> >     > > populated for these weird things, or must we add a bunch of
> special-case
> >     > > cleanup code to submitMissingStages?
> >     > >
> >     >
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message