spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: RDDs with no partitions
Date Fri, 23 Aug 2013 05:51:34 GMT
We already do a quick, no-op return from DAGScheduler.runJob when there are
no partitions submitted with the job, so running a job with no partitions
in the usual way isn't a problem.  That still leaves at least the "zero
split job" in the DAGSchedulerSuite and the possibility of shuffleMap
stages with no partitions.  Is "zero split job" testing anything
meaningful, or is its only purpose to cause me headaches?  Can shuffleMap
stages actually have no partitions, or is this (also) a distraction posing
as a legitimate problem?

In short, when are RDDs with no partitions real things that we actually
have to deal with?



On Thu, Aug 22, 2013 at 9:20 PM, Reynold Xin <reynoldx@gmail.com> wrote:

> Being the guy that added the empty partition rdd, I second your idea that
> we should just short-circuit those in DAGScheduler.runJob.
>
>
>
>
> On Thu, Aug 22, 2013 at 8:26 PM, Mark Hamstra <mark@clearstorydata.com
> >wrote:
>
> > So how do these get created, and are we really handling them correctly?
> >  What is prompting my questions is that I'm looking at making sure that
> the
> > various data structures in the DAGScheduler shrink when appropriate
> instead
> > of growing without bounds.  Jobs with no partitions and the "zero split
> > job" test in the DAGSchedulerSuite really throw a wrench into the works.
> >  That's because in the DAGScheduler we go part way along in handling this
> > weird case as though it were a normal job submission, we start
> initializing
> > or adding to various data structures, etc.; then we pretty much bail out
> in
> > submitMissingTasks when we find out that there actually are no tasks to
> be
> > done.  We remove the stage from the set of running stages, but we don't
> > ever clean up pendingTasks, activeJobs, stageIdToStage, stageToInfos, and
> > others because no tasks are ever submitted for the stage, so there are
> > never any completion events, nor is the stage aborted -- i.e. the normal
> > paths to cleanup are never taken.  The end result is that shuffleMap
> stages
> > with no partitions (can these even occur?) never complete, and job's with
> > no partitions would seem also to persist forever.
> >
> > In short, RDDs with no partitions do really weird things to the
> > DAGScheduler.
> >
> > So, if there is no way to effectively prevent the creation of RDDs with
> no
> > partitions, is there any reason why we can't short-circuit their handling
> > within the DAGScheduler so that data structures are never built or
> > populated for these weird things, or must we add a bunch of special-case
> > cleanup code to submitMissingStages?
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message