spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: setting partitioners with hadoop rdds
Date Tue, 28 Jan 2014 08:25:22 GMT
Hey Imran,

You probably have to create a subclass of HadoopRDD to do this, or some RDD that wraps around
the HadoopRDD. It would be a cool feature but HDFS itself has no information about partitioning,
so your application needs to track it.

Matei

On Jan 27, 2014, at 11:57 PM, Imran Rashid <imran@quantifind.com> wrote:

> Hi,
> 
> 
> I'm trying to figure out how to get partitioners to work correctly with hadoop rdds,
so that I can get narrow dependencies & avoid shuffling.  I feel like I must be missing
something obvious.
> 
> I can create an RDD with a parititioner of my choosing, shuffle it and then save it out
to hdfs.  But I can't figure out how to get it to still have that partitioner after I read
it back in from hdfs.  HadoopRDD always has the partitioner set to None, and there isn't any
way for me to change it.
> 
> the reason I care is b/c if I can set the partitioner, then there would be a narrow dependency,
so I can avoid a shuffle.  I have a big data set I'm saving on hdfs.  Then some time later,
in a totally independent spark context, I read a little more data in, shuffle it w/ the same
partitioner, and then want to join it to the previous data that was sitting on hdfs.
> 
> I guess this can't be done in general, since you don't have any guarantees on the how
the file was saved in hdfs.  But, it still seems like there ought to be a way to do this,
even if I need to enforce safety at the application level.
> 
> sorry if I'm missing something obvious ...
> 
> thanks,
> Imran


Mime
View raw message