spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Lam <chiling...@gmail.com>
Subject Re: spark sql partitioned by date... read last date
Date Sun, 01 Nov 2015 21:36:14 GMT
Hi Koert,

You should be able to see if it requires scanning the whole data by
"explain" the query. The physical plan should say something about it. I
wonder if you are trying the distinct-sort-by-limit approach or the
max-date approach?

Best Regards,

Jerry


On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers <koert@tresata.com> wrote:

> it seems pretty fast, but if i have 2 partitions and 10mm records i do
> have to dedupe (distinct) 10mm records
>
> a direct way to just find out what the 2 partitions are would be much
> faster. spark knows it, but its not exposed.
>
> On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers <koert@tresata.com> wrote:
>
>> it seems to work but i am not sure if its not scanning the whole dataset.
>> let me dig into tasks a a bit
>>
>> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam <chilinglam@gmail.com> wrote:
>>
>>> Hi Koert,
>>>
>>> If the partitioned table is implemented properly, I would think "select
>>> distinct(date) as dt from table order by dt DESC limit 1" would return the
>>> latest dates without scanning the whole dataset. I haven't try it that
>>> myself. It would be great if you can report back if this actually works or
>>> not. :)
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>>
>>> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers <koert@tresata.com> wrote:
>>>
>>>> hello all,
>>>> i am trying to get familiar with spark sql partitioning support.
>>>>
>>>> my data is partitioned by date, so like this:
>>>> data/date=2015-01-01
>>>> data/date=2015-01-02
>>>> data/date=2015-01-03
>>>> ...
>>>>
>>>> lets say i would like a batch process to read data for the latest date
>>>> only. how do i proceed?
>>>> generally the latest date will be yesterday, but it could be a day
>>>> older or maybe 2.
>>>>
>>>> i understand that i will have to do something like:
>>>> df.filter(df("date") === some_date_string_here)
>>>>
>>>> however i do now know what some_date_string_here should be. i would
>>>> like to inspect the available dates and pick the latest. is there an
>>>> efficient way to  find out what the available partitions are?
>>>>
>>>> thanks! koert
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message