spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Chapman <keithgchap...@gmail.com>
Subject Re: Alternatives for dataframe collectAsList()
Date Tue, 04 Apr 2017 19:12:16 GMT
As Paul said it really depends on what you want to do with your data,
perhaps writing it to a file would be a better option, but again it depends
on what you want to do with the data you collect.

Regards,
Keith.

http://keith-chapman.com

On Tue, Apr 4, 2017 at 7:38 AM, Eike von Seggern <eike.seggern@sevenval.com>
wrote:

> Hi,
>
> depending on what you're trying to achieve `RDD.toLocalIterator()` might
> help you.
>
> Best
>
> Eike
>
>
> 2017-03-29 21:00 GMT+02:00 szep.laszlo.it <szep.laszlo.it@gmail.com>:
>
>> Hi,
>>
>> after I created a dataset
>>
>> Dataset<Row> df = sqlContext.sql("query");
>>
>> I need to have a result values and I call a method: collectAsList()
>>
>> List<Row> list = df.collectAsList();
>>
>> But it's very slow, if I work with large datasets (20-30 million
>> records). I
>> know, that the result isn't presented in driver app, that's why it takes
>> long time, because collectAsList() collect all data from worker nodes.
>>
>> But then what is the right way to get result values? Is there an other
>> solution to iterate over a result dataset rows, or get values? Can anyone
>> post a small & working example?
>>
>> Thanks & Regards,
>> Laszlo Szep
>>
>

Mime
View raw message