spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jakub Dubovsky <spark.dubovsky.ja...@gmail.com>
Subject Re: Why there is no top method in dataset api
Date Tue, 13 Sep 2016 14:08:36 GMT
Thanks Sean,

the important part of your answer for me is that orderBy + limit is doing
only "partial sort" because of optimizer. That's what I was missing. I will
give it a try...

J.D.

On Mon, Sep 5, 2016 at 2:26 PM, Sean Owen <sowen@cloudera.com> wrote:

> ​No, ​
> I'm not advising you to use .rdd, just saying it is possible.
> ​Although I'd only use RDDs if you had a good reason to, given Datasets
> now, they are not gone or even deprecated.​
>
> You do not need to order the whole data set to get the top eleme
> ​nt. That isn't what top does though. You might be interested to look at
> the source code. Nor is it what orderBy does if the optimizer is any good.
>
> ​Computing .rdd doesn't materialize an RDD. It involves some non-zero
> overhead in creating a plan, which should be minor compared to execution.
> So would any computation of "top N" on a Dataset, so I don't think this is
> relevant.
>
>
> ​orderBy + take is already the way to accomplish "Dataset.top". It works
> on Datasets, and therefore DataFrames too, for the reason you give. I'm not
> sure what you're asking there.
>
>
> On Mon, Sep 5, 2016, 13:01 Jakub Dubovsky <spark.dubovsky.jakub@gmail.com>
> wrote:
>
>> Thanks Sean,
>>
>> I was under impression that spark creators are trying to persuade user
>> community not to use RDD api directly. Spark summit I attended was full of
>> this. So I am a bit surprised that I hear use-rdd-api as an advice from
>> you. But if this is a way then I have a second question. For conversion
>> from dataset to rdd I would use Dataset.rdd lazy val. Since it is a lazy
>> val it suggests there is some computation going on to create rdd as a copy.
>> The question is how much computationally expansive is this conversion? If
>> there is a significant overhead then it is clear why one would want to have
>> top method directly on Dataset class.
>>
>> Ordering whole dataset only to take first 10 or so top records is not
>> really an acceptable option for us. Comparison function can be expansive
>> and the size of dataset is (unsurprisingly) big.
>>
>> To be honest I do not really understand what do you mean by b). Since
>> DataFrame is now only an alias for Dataset[Row] what do you mean by
>> "DataFrame-like counterpart"?
>>
>> Thanks
>>
>> On Thu, Sep 1, 2016 at 2:31 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>>> You can always call .rdd.top(n) of course. Although it's slightly
>>> clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an
>>> easier way.
>>>
>>> I don't think if there's a strong reason other than it wasn't worth it
>>> to write this and many other utility wrappers that a) already exist on
>>> the underlying RDD API if you want them, and b) have a DataFrame-like
>>> counterpart already that doesn't really need wrapping in a different
>>> API.
>>>
>>> On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky
>>> <spark.dubovsky.jakub@gmail.com> wrote:
>>> > Hey all,
>>> >
>>> > in RDD api there is very usefull method called top. It finds top n
>>> records
>>> > in according to certain ordering without sorting all records. Very
>>> usefull!
>>> >
>>> > There is no top method nor similar functionality in Dataset api. Has
>>> anybody
>>> > any clue why? Is there any specific reason for this?
>>> >
>>> > Any thoughts?
>>> >
>>> > thanks
>>> >
>>> > Jakub D.
>>>
>>
>>

Mime
View raw message