spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemant Bhanawat <>
Subject Re: taking an n number of rows from and RDD starting from an index
Date Wed, 02 Sep 2015 06:52:24 GMT
I think rdd.toLocalIterator is what you want. But it will keep one
partition's data in-memory.

On Wed, Sep 2, 2015 at 10:05 AM, Niranda Perera <>

> Hi all,
> I have a large set of data which would not fit into the memory. So, I wan
> to take n number of data from the RDD given a particular index. for an
> example, take 1000 rows starting from the index 1001.
> I see that there is a  take(num: Int): Array[T] method in the RDD, but it
> only returns the 'first n number of rows'.
> the simplest use case of this, requirement is, say, I write a custom
> relation provider with a custom relation extending the InsertableRelation.
> say I submit this query,
> "insert into table abc select * from xyz sort by x asc"
> in my custom relation, I have implemented the def insert(data: DataFrame,
> overwrite: Boolean): Unit
> method. here, since the data is large, I can not call methods such as
> DataFrame.collect(). Instead, I could do, DataFrame.foreachpartition(...).
> As you could see, the resultant DF from the "select * from xyz sort by x
> asc" is sorted, and if I sun, foreachpartition on that DF and implement the
> insert method, this sorted order would be affected, since the inserting
> operation would be done in parallel in each partition.
> in order to handle this, my initial idea was to take rows from the RDD in
> batches and do the insert operation, and for that I was looking for a
> method to take n number of rows starting from a given index.
> is there any better way to handle this, in RDDs?
> your assistance in this regard is highly appreciated.
> cheers
> --
> Niranda
> @n1r44 <>

View raw message