spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lucas.gary@gmail.com" <lucas.g...@gmail.com>
Subject Re: Alternatives for dataframe collectAsList()
Date Wed, 05 Apr 2017 04:55:19 GMT
As Keith said, it depends on what you want to do with your data.

>From a pipelining perspective the general flow (YMMV) is:

Load dataset(s) -> Transform and / or Join --> Aggregate --> Write dataset

Each step in the pipeline does something distinct with the data.

The end step is usually loading the final data into something that can
display / query it (IE a DBMS of some sort)

That's where you'd start doing your queries etc.

There's generally no 'good' IMO reason to be collecting your data on the
driver except for testing / validation / exploratory work.

I hope that helps.

Gary Lucas

On 4 April 2017 at 12:12, Keith Chapman <keithgchapman@gmail.com> wrote:

> As Paul said it really depends on what you want to do with your data,
> perhaps writing it to a file would be a better option, but again it depends
> on what you want to do with the data you collect.
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
> On Tue, Apr 4, 2017 at 7:38 AM, Eike von Seggern <
> eike.seggern@sevenval.com> wrote:
>
>> Hi,
>>
>> depending on what you're trying to achieve `RDD.toLocalIterator()` might
>> help you.
>>
>> Best
>>
>> Eike
>>
>>
>> 2017-03-29 21:00 GMT+02:00 szep.laszlo.it <szep.laszlo.it@gmail.com>:
>>
>>> Hi,
>>>
>>> after I created a dataset
>>>
>>> Dataset<Row> df = sqlContext.sql("query");
>>>
>>> I need to have a result values and I call a method: collectAsList()
>>>
>>> List<Row> list = df.collectAsList();
>>>
>>> But it's very slow, if I work with large datasets (20-30 million
>>> records). I
>>> know, that the result isn't presented in driver app, that's why it takes
>>> long time, because collectAsList() collect all data from worker nodes.
>>>
>>> But then what is the right way to get result values? Is there an other
>>> solution to iterate over a result dataset rows, or get values? Can anyone
>>> post a small & working example?
>>>
>>> Thanks & Regards,
>>> Laszlo Szep
>>>
>>
>

Mime
View raw message