spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Miller <cmiller11...@gmail.com>
Subject Re: Spark schema evolution
Date Tue, 22 Mar 2016 14:32:35 GMT
With Avro you solve this by using a default value for the new field...
maybe Parquet is the same?


--
Chris Miller

On Tue, Mar 22, 2016 at 9:34 PM, gtinside <gtinside@gmail.com> wrote:

> Hi ,
>
> I have a table sourced from* 2 parquet files* with few extra columns in one
> of the parquet file. Simple * queries works fine but queries with predicate
> on extra column doesn't work and I get column not found
>
> *Column resp_party_type exist in just one parquet file*
>
> a) Query that work :
> select resp_party_type  from operational_analytics
>
> b) Query that doesn't work : (complains about missing column
> *resp_party_type *)
> select category as Events, resp_party as Team, count(*) as Total from
> operational_analytics where application = 'PeopleMover' and resp_party_type
> = 'Team' group by category, resp_party
>
> *Query Plan for (b)*
> == Physical Plan ==
> TungstenAggregate(key=[category#30986,resp_party#31006],
> functions=[(count(1),mode=Final,isDistinct=false)],
> output=[Events#36266,Team#36267,Total#36268L])
>  TungstenExchange hashpartitioning(category#30986,resp_party#31006)
>   TungstenAggregate(key=[category#30986,resp_party#31006],
> functions=[(count(1),mode=Partial,isDistinct=false)],
> output=[category#30986,resp_party#31006,currentCount#36272L])
>    Project [category#30986,resp_party#31006]
>     Filter ((application#30983 = PeopleMover) && (resp_party_type#31007 =
> Team))
>      Scan
>
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/operational_analytics/operational_analytics_peoplemover.parquet,snackfs://tst:9042/aladdin_data_beta/operational_analytics/operational_analytics_mis.parquet][category#30986,resp_party#31006,application#30983,resp_party_type#31007]
>
>
> I have set spark.sql.parquet.mergeSchema = true and
> spark.sql.parquet.filterPushdown = true. When I set
> spark.sql.parquet.filterPushdown = false Query (b) starts working,
> execution
> plan after setting the filterPushdown = false for Query(b)
>
> == Physical Plan ==
> TungstenAggregate(key=[category#30986,resp_party#31006],
> functions=[(count(1),mode=Final,isDistinct=false)],
> output=[Events#36313,Team#36314,Total#36315L])
>  TungstenExchange hashpartitioning(category#30986,resp_party#31006)
>   TungstenAggregate(key=[category#30986,resp_party#31006],
> functions=[(count(1),mode=Partial,isDistinct=false)],
> output=[category#30986,resp_party#31006,currentCount#36319L])
>    Project [category#30986,resp_party#31006]
>     Filter ((application#30983 = PeopleMover) && (resp_party_type#31007 =
> Team))
>      Scan
>
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/operational_analytics/operational_analytics_peoplemover.parquet,snackfs://tst:9042/aladdin_data_beta/operational_analytics/operational_analytics_mis.parquet][category#30986,resp_party#31006,application#30983,resp_party_type#31007]
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-schema-evolution-tp26563.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message