spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Woody <patrick.woo...@gmail.com>
Subject Re: Lazy casting with Catalyst
Date Sat, 28 Mar 2015 18:37:41 GMT
So it looks like this was actually a combination of using out of date
artifacts and further debugging needed on my part. Ripping the logic out
and testing in spark-shell works fine, so it is likely something upstream
in my application that causes it to take the whole Row.

Thanks!
-Pat





On Sat, Mar 28, 2015 at 12:34 PM, Cheng Lian <lian.cs.zju@gmail.com> wrote:

>
> On 3/29/15 12:26 AM, Patrick Woody wrote:
>
>  Hey Cheng,
>
>  I didn't meant that catalyst casting was eager, just that my approaches
> thus far seem to have been. Maybe I should give a concrete example?
>
> I have columns A, B, C where B is saved as a String but I'd like all
> references to B to go through a Cast to decimal regardless of the code used
> on the SchemaRDD. So if someone does a min(B) it uses Decimal ordering
> instead of String.
>
>  One approach that I had taken was to do a select of everything with the
> casts on certain columns, but then when I did a count(literal(1)) on top of
> that RDD it seemed to bring in the whole row.
>
> What version of Spark SQL are you using? Would you mind to provide a brief
> snippet that can reproduce this issue? This might be a bug depending on
> your concrete usage. Thanks in advance!
>
>
>  Thanks!
> -Pat
>
> On Sat, Mar 28, 2015 at 11:35 AM, Cheng Lian <lian.cs.zju@gmail.com>
> wrote:
>
>> Hi Pat,
>>
>> I don't understand what "lazy casting" mean here. Why do you think
>> current Catalyst casting is "eager"? Casting happens at runtime, and
>> doesn't disable column pruning.
>>
>> Cheng
>>
>>
>> On 3/28/15 11:26 PM, Patrick Woody wrote:
>>
>>> Hi all,
>>>
>>> In my application, we take input from Parquet files where BigDecimals are
>>> written as Strings to maintain arbitrary precision.
>>>
>>> I was hoping to convert these back over to Decimal with Unlimited
>>> precision, but I'd still like to maintain the Parquet column pruning (all
>>> my attempts thus far seem to bring in the whole Row). Is it possible to
>>> do
>>> this lazily through catalyst?
>>>
>>> Basically I'd want to do Cast(col, DecimalType()) whenever col is
>>> actually
>>> referenced. Any tips on how to approach this would be appreciated.
>>>
>>> Thanks!
>>> -Pat
>>>
>>>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message