spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Rodríguez Hortalá <juan.rodriguez.hort...@gmail.com>
Subject Re: taking an n number of rows from and RDD starting from an index
Date Wed, 02 Sep 2015 07:03:01 GMT
Hi,

Maybe you could use zipWithIndex and filter to skip the first elements. For
example starting from

scala> sc.parallelize(100 to 120, 4).zipWithIndex.collect
res12: Array[(Int, Long)] = Array((100,0), (101,1), (102,2), (103,3),
(104,4), (105,5), (106,6), (107,7), (108,8), (109,9), (110,10), (111,11),
(112,12), (113,13), (114,14), (115,15), (116,16), (117,17), (118,18),
(119,19), (120,20))

we can get the 3 first elements starting from the 4th (counting from 0) as

scala> sc.parallelize(100 to 120, 4).zipWithIndex.filter(_._2 >=4).take(3)
res14: Array[(Int, Long)] = Array((104,4), (105,5), (106,6))

Hope that helps


2015-09-02 8:52 GMT+02:00 Hemant Bhanawat <hemant9379@gmail.com>:

> I think rdd.toLocalIterator is what you want. But it will keep one
> partition's data in-memory.
>
> On Wed, Sep 2, 2015 at 10:05 AM, Niranda Perera <niranda.perera@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I have a large set of data which would not fit into the memory. So, I wan
>> to take n number of data from the RDD given a particular index. for an
>> example, take 1000 rows starting from the index 1001.
>>
>> I see that there is a  take(num: Int): Array[T] method in the RDD, but it
>> only returns the 'first n number of rows'.
>>
>> the simplest use case of this, requirement is, say, I write a custom
>> relation provider with a custom relation extending the InsertableRelation.
>>
>> say I submit this query,
>> "insert into table abc select * from xyz sort by x asc"
>>
>> in my custom relation, I have implemented the def insert(data: DataFrame,
>> overwrite: Boolean): Unit
>> method. here, since the data is large, I can not call methods such as
>> DataFrame.collect(). Instead, I could do, DataFrame.foreachpartition(...).
>> As you could see, the resultant DF from the "select * from xyz sort by x
>> asc" is sorted, and if I sun, foreachpartition on that DF and implement the
>> insert method, this sorted order would be affected, since the inserting
>> operation would be done in parallel in each partition.
>>
>> in order to handle this, my initial idea was to take rows from the RDD in
>> batches and do the insert operation, and for that I was looking for a
>> method to take n number of rows starting from a given index.
>>
>> is there any better way to handle this, in RDDs?
>>
>> your assistance in this regard is highly appreciated.
>>
>> cheers
>>
>> --
>> Niranda
>> @n1r44 <https://twitter.com/N1R44>
>> https://pythagoreanscript.wordpress.com/
>>
>
>

Mime
View raw message