spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imran Rashid <im...@quantifind.com>
Subject Re: setting partitioners with hadoop rdds
Date Tue, 28 Jan 2014 15:15:47 GMT
Thanks for the info.  Do you think this would be useful in spark itself?
add a function to RDD like "assumePartitioner(partitioner: Partitioner,
verify: Boolean)".  Where verify would run a mapPartitionsWithIndex, to
check that every record was actually in the partition it belonged in?

I'm surprised this hasn't come up before -- maybe there is a better way to
do something similar?


On Tue, Jan 28, 2014 at 12:25 AM, Matei Zaharia <matei.zaharia@gmail.com>wrote:

> Hey Imran,
>
> You probably have to create a subclass of HadoopRDD to do this, or some
> RDD that wraps around the HadoopRDD. It would be a cool feature but HDFS
> itself has no information about partitioning, so your application needs to
> track it.
>
> Matei
>
> On Jan 27, 2014, at 11:57 PM, Imran Rashid <imran@quantifind.com> wrote:
>
> > Hi,
> >
> >
> > I'm trying to figure out how to get partitioners to work correctly with
> hadoop rdds, so that I can get narrow dependencies & avoid shuffling.  I
> feel like I must be missing something obvious.
> >
> > I can create an RDD with a parititioner of my choosing, shuffle it and
> then save it out to hdfs.  But I can't figure out how to get it to still
> have that partitioner after I read it back in from hdfs.  HadoopRDD always
> has the partitioner set to None, and there isn't any way for me to change
> it.
> >
> > the reason I care is b/c if I can set the partitioner, then there would
> be a narrow dependency, so I can avoid a shuffle.  I have a big data set
> I'm saving on hdfs.  Then some time later, in a totally independent spark
> context, I read a little more data in, shuffle it w/ the same partitioner,
> and then want to join it to the previous data that was sitting on hdfs.
> >
> > I guess this can't be done in general, since you don't have any
> guarantees on the how the file was saved in hdfs.  But, it still seems like
> there ought to be a way to do this, even if I need to enforce safety at the
> application level.
> >
> > sorry if I'm missing something obvious ...
> >
> > thanks,
> > Imran
>
>

Mime
View raw message