spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niranda Perera <niranda.per...@gmail.com>
Subject taking an n number of rows from and RDD starting from an index
Date Wed, 02 Sep 2015 04:35:13 GMT
Hi all,

I have a large set of data which would not fit into the memory. So, I wan
to take n number of data from the RDD given a particular index. for an
example, take 1000 rows starting from the index 1001.

I see that there is a  take(num: Int): Array[T] method in the RDD, but it
only returns the 'first n number of rows'.

the simplest use case of this, requirement is, say, I write a custom
relation provider with a custom relation extending the InsertableRelation.

say I submit this query,
"insert into table abc select * from xyz sort by x asc"

in my custom relation, I have implemented the def insert(data: DataFrame,
overwrite: Boolean): Unit
method. here, since the data is large, I can not call methods such as
DataFrame.collect(). Instead, I could do, DataFrame.foreachpartition(...).
As you could see, the resultant DF from the "select * from xyz sort by x
asc" is sorted, and if I sun, foreachpartition on that DF and implement the
insert method, this sorted order would be affected, since the inserting
operation would be done in parallel in each partition.

in order to handle this, my initial idea was to take rows from the RDD in
batches and do the insert operation, and for that I was looking for a
method to take n number of rows starting from a given index.

is there any better way to handle this, in RDDs?

your assistance in this regard is highly appreciated.

cheers

-- 
Niranda
@n1r44 <https://twitter.com/N1R44>
https://pythagoreanscript.wordpress.com/

Mime
View raw message