spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Artz <michaelea...@gmail.com>
Subject Re: Multiple filters vs multiple conditions
Date Tue, 03 Oct 2017 13:17:19 GMT
Hi Ahmed,

Depending on which version you have it could matter.  We received an email
about multiple conditions in the filter not being picked up. I copied the
email below that was sent out the the spark user list.  The use never tried
multiple one condition filters which might have worked.

Hi Spark users,

I've got an issue where I wrote a filter on a Hive table using dataframes
and despite setting:
spark.sql.hive.metastorePartitionPruning=true no partitions are being
pruned.

In short:

Doing this: table.filter("partition=x or partition=y") will result in Spark
fetching all partition metadata from the Hive metastore and doing the
filtering after fetching the partitions.

On the other hand if my filter is "simple":
table.filter("partition=x ")
Spark does a call to the metastore that passes along the filter and fetches
just the ones it needs.

Our case is where we have a lot of partitions on a table and the calls that
result in all the partitions take minutes as well as causing us memory
issues. Is this a bug or is there a better way of doing the filter call?

Thanks,
 Patrick

On Oct 3, 2017 9:01 AM, "ayan guha" <guha.ayan@gmail.com> wrote:

> Remember transformations are lazy.....so nothing happens until you call an
> action.....at that point both are same.
>
> On Tue, Oct 3, 2017 at 11:19 PM, Femi Anthony <femibyte@gmail.com> wrote:
>
>> I would assume that the optimizer would end up transforming both to the
>> same expression.
>>
>> Femi
>>
>> Sent from my iPhone
>>
>> > On Oct 3, 2017, at 8:14 AM, Ahmed Mahmoud <don1559@gmail.com> wrote:
>> >
>> > Hi All,
>> >
>> > Just a quick question from an optimisation point of view:
>> >
>> > Approach 1:
>> > .filter (t-> t.x=1 && t.y=2)
>> >
>> > Approach 2:
>> > .filter (t-> t.x=1)
>> > .filter (t-> t.y=2)
>> >
>> > Is there a difference or one is better than the other  or both are same?
>> >
>> > Thanks!
>> > Ahmed Mahmoud
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Mime
View raw message