spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: spark sql partitioned by date... read last date
Date Sun, 01 Nov 2015 21:39:08 GMT
i was going for the distinct approach, since i want it to be general enough
to also solve other related problems later. the max-date is likely to be
faster though.

On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam <chilinglam@gmail.com> wrote:

> Hi Koert,
>
> You should be able to see if it requires scanning the whole data by
> "explain" the query. The physical plan should say something about it. I
> wonder if you are trying the distinct-sort-by-limit approach or the
> max-date approach?
>
> Best Regards,
>
> Jerry
>
>
> On Sun, Nov 1, 2015 at 4:25 PM, Koert Kuipers <koert@tresata.com> wrote:
>
>> it seems pretty fast, but if i have 2 partitions and 10mm records i do
>> have to dedupe (distinct) 10mm records
>>
>> a direct way to just find out what the 2 partitions are would be much
>> faster. spark knows it, but its not exposed.
>>
>> On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers <koert@tresata.com> wrote:
>>
>>> it seems to work but i am not sure if its not scanning the whole
>>> dataset. let me dig into tasks a a bit
>>>
>>> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam <chilinglam@gmail.com> wrote:
>>>
>>>> Hi Koert,
>>>>
>>>> If the partitioned table is implemented properly, I would think "select
>>>> distinct(date) as dt from table order by dt DESC limit 1" would return the
>>>> latest dates without scanning the whole dataset. I haven't try it that
>>>> myself. It would be great if you can report back if this actually works or
>>>> not. :)
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>>
>>>> On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers <koert@tresata.com>
>>>> wrote:
>>>>
>>>>> hello all,
>>>>> i am trying to get familiar with spark sql partitioning support.
>>>>>
>>>>> my data is partitioned by date, so like this:
>>>>> data/date=2015-01-01
>>>>> data/date=2015-01-02
>>>>> data/date=2015-01-03
>>>>> ...
>>>>>
>>>>> lets say i would like a batch process to read data for the latest date
>>>>> only. how do i proceed?
>>>>> generally the latest date will be yesterday, but it could be a day
>>>>> older or maybe 2.
>>>>>
>>>>> i understand that i will have to do something like:
>>>>> df.filter(df("date") === some_date_string_here)
>>>>>
>>>>> however i do now know what some_date_string_here should be. i would
>>>>> like to inspect the available dates and pick the latest. is there an
>>>>> efficient way to  find out what the available partitions are?
>>>>>
>>>>> thanks! koert
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message