spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <reyno...@gmail.com>
Subject Re: RDDs with no partitions
Date Fri, 23 Aug 2013 05:57:01 GMT
I actually don't think there is any reason to have 0 partition stages, be
it either result stage or shufflemap.

It looks like Charles added those. Charles, any comments?



On Thu, Aug 22, 2013 at 10:51 PM, Mark Hamstra <mark@clearstorydata.com>wrote:

> We already do a quick, no-op return from DAGScheduler.runJob when there are
> no partitions submitted with the job, so running a job with no partitions
> in the usual way isn't a problem.  That still leaves at least the "zero
> split job" in the DAGSchedulerSuite and the possibility of shuffleMap
> stages with no partitions.  Is "zero split job" testing anything
> meaningful, or is its only purpose to cause me headaches?  Can shuffleMap
> stages actually have no partitions, or is this (also) a distraction posing
> as a legitimate problem?
>
> In short, when are RDDs with no partitions real things that we actually
> have to deal with?
>
>
>
> On Thu, Aug 22, 2013 at 9:20 PM, Reynold Xin <reynoldx@gmail.com> wrote:
>
> > Being the guy that added the empty partition rdd, I second your idea that
> > we should just short-circuit those in DAGScheduler.runJob.
> >
> >
> >
> >
> > On Thu, Aug 22, 2013 at 8:26 PM, Mark Hamstra <mark@clearstorydata.com
> > >wrote:
> >
> > > So how do these get created, and are we really handling them correctly?
> > >  What is prompting my questions is that I'm looking at making sure that
> > the
> > > various data structures in the DAGScheduler shrink when appropriate
> > instead
> > > of growing without bounds.  Jobs with no partitions and the "zero split
> > > job" test in the DAGSchedulerSuite really throw a wrench into the
> works.
> > >  That's because in the DAGScheduler we go part way along in handling
> this
> > > weird case as though it were a normal job submission, we start
> > initializing
> > > or adding to various data structures, etc.; then we pretty much bail
> out
> > in
> > > submitMissingTasks when we find out that there actually are no tasks to
> > be
> > > done.  We remove the stage from the set of running stages, but we don't
> > > ever clean up pendingTasks, activeJobs, stageIdToStage, stageToInfos,
> and
> > > others because no tasks are ever submitted for the stage, so there are
> > > never any completion events, nor is the stage aborted -- i.e. the
> normal
> > > paths to cleanup are never taken.  The end result is that shuffleMap
> > stages
> > > with no partitions (can these even occur?) never complete, and job's
> with
> > > no partitions would seem also to persist forever.
> > >
> > > In short, RDDs with no partitions do really weird things to the
> > > DAGScheduler.
> > >
> > > So, if there is no way to effectively prevent the creation of RDDs with
> > no
> > > partitions, is there any reason why we can't short-circuit their
> handling
> > > within the DAGScheduler so that data structures are never built or
> > > populated for these weird things, or must we add a bunch of
> special-case
> > > cleanup code to submitMissingStages?
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message