spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuri Budilov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26777) SQL worked in 2.3.2 and fails in 2.4.0
Date Wed, 13 Feb 2019 00:45:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766613#comment-16766613
] 

Yuri Budilov commented on SPARK-26777:
--------------------------------------

I split the above query from 1 into 3 and it worked OK on 2.4.0 EMR Spark, so I was able to
fix my production job on latest EMR Spark 2.4, which is what I am paid to do. This SQL statement
works "as is" on Presto/Athena, so it is valid SQL code on Athena and on EMR Spark 2.3.x. 

Since I posted this, I also tried using *explicit* query table *aliases* everywhere in
SQL (i.e. *_select_* *_from table_name as alias z_*, in all *sub-select* statements), it
still *failed even with explicit aliases.* 

Regarding your request: sorry, I do not and will not use Scala and I will not build a standalone
Spark cluster on-premises either - because both of these tests are totally irrelevant to me,
even if they do work OK in 2.4, it of no value to me, the end user of Spark.

Fact is - it does not work on PySpark Spark 2.4 EMR but it used to work OK on Pyspark EMR
Spark 2.3.x.

This is all I care about. It looks like a regression bug to me. If you folks don't want to
fix it, this is your choice, I do understand this is the world of free-software so I do not
expect anything.

If the problem is not a bug but a feature-by-design and if I am doing something wrong with
my SQL then some documentation would be greatly appreciated so I and future readers can learn
what not to do on Spark SQL. Perhaps delete this Jira Issue so it does not mislead anyone
else.

 

 

 

 

 

 

> SQL worked in 2.3.2 and fails in 2.4.0
> --------------------------------------
>
>                 Key: SPARK-26777
>                 URL: https://issues.apache.org/jira/browse/SPARK-26777
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Yuri Budilov
>            Priority: Major
>
> Following SQL worked in Spark 2.3.2 and now fails on 2.4.0 (AWS EMR Spark)
> ------------ PySpark call below:
> spark.sql("select partition_year_utc,partition_month_utc,partition_day_utc \
> from datalake_reporting.copy_of_leads_notification \
> where partition_year_utc = (select max(partition_year_utc) from datalake_reporting.copy_of_leads_notification)
\
> and partition_month_utc = \
>  (select max(partition_month_utc) from datalake_reporting.copy_of_leads_notification
as m \
>  where \
>  m.partition_year_utc = (select max(partition_year_utc) from datalake_reporting.copy_of_leads_notification))
\
>  and partition_day_utc = (select max(d.partition_day_utc) from datalake_reporting.copy_of_leads_notification
as d \
>  where d.partition_month_utc = \
>  (select max(m1.partition_month_utc) from datalake_reporting.copy_of_leads_notification
as m1 \
>  where m1.partition_year_utc = \
>  (select max(y.partition_year_utc) from datalake_reporting.copy_of_leads_notification
as y) \
>  ) \
>  ) \
>  order by 1 desc, 2 desc, 3 desc limit 1 ").show(1,False)
> Error: (no need for data, this is syntax).
> py4j.protocol.Py4JJavaError: An error occurred while calling o1326.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#4495
[]
>  
> Note: all 3 columns in query are Partitioned columns - see bottom of the schema)
>  
> Hive EMR AWS Schema is:
>  
> CREATE EXTERNAL TABLE `copy_of_leads_notification`(
> `message.environment.siteorigin` string, `dcpheader.dcploaddateutc` string, `message.id`
int, `source.properties._country` string, `message.created` string, `dcpheader.generatedmessageid`
string, `message.tags` bigint, `source.properties._enqueuedtimeutc` string, `source.properties._leadtype`
string, `message.itemid` string, `message.prospect.postcode` string, `message.prospect.email`
string, `message.referenceid` string, `message.item.year` string, `message.identifier` string,
`dcpheader.dcploadmonthutc` string, `message.processed` string, `source.properties._tenant`
string, `message.item.price` string, `message.subscription.confirmresponse` boolean, `message.itemtype`
string, `message.prospect.lastname` string, `message.subscription.insurancequote` boolean,
`source.exchangename` string, `message.prospect.identificationnumbers` bigint, `message.environment.ipaddress`
string, `dcpheader.dcploaddayutc` string, `source.properties._itemtype` string, `source.properties._requesttype`
string, `message.item.make` string, `message.prospect.firstname` string, `message.subscription.survey`
boolean, `message.prospect.homephone` string, `message.extendedproperties` bigint, `message.subscription.financequote`
boolean, `message.uniqueidentifier` string, `source.properties._id` string, `dcpheader.sourcemessageguid`
string, `message.requesttype` string, `source.routingkey` string, `message.service` string,
`message.item.model` string, `message.environment.pagesource` string, `source.source` string,
`message.sellerid` string, `partition_date_utc` string, `message.selleridentifier` string,
`message.subscription.newsletter` boolean, `dcpheader.dcploadyearutc` string, `message.leadtype`
string, `message.history` bigint, `message.callconnect.calloutcome` string, `message.callconnect.datecreatedutc`
string, `message.callconnect.callrecordingurl` string, `message.callconnect.transferoutcome`
string, `message.callconnect.hiderecording` boolean, `message.callconnect.callstartutc` string,
`message.callconnect.code` string, `message.callconnect.callduration` string, `message.fraudnetinfo`
string, `message.callconnect.answernumber` string, `message.environment.sourcedevice` string,
`message.comments` string, `message.fraudinfo.servervariables` bigint, `message.callconnect.servicenumber`
string, `message.callconnect.callid` string, `message.callconnect.voicemailurl` string, `message.item.stocknumber`
string, `message.callconnect.answerduration` string, `message.callconnect.callendutc` string,
`message.item.series` string, `message.item.detailsurl` string, `message.item.pricetype` string,
`message.item.description` string, `message.item.colour` string, `message.item.badge` string,
`message.item.odometer` string, `message.environment.requestheader` string, `message.item.registrationnumber`
string, `message.item.bodytype` string, `message.item.fueltype` string, `message.item.redbookcode`
string, `message.item.spotid` string, `message.item.id` string, `message.item.transmission`
string, `message.item.vin` string, `message.item.enginedescription` string, `message.prospect.mobilephone`
string, `message.prospect.membertrackingid` string, `message.environment.username` string,
`message.prospect.workphone` string, `message.environment.servername` string, `message.environment.sessionid`
string, `message.tradein.type` string, `message.tradein.model` string, `message.tradein.year`
string, `message.tradein.make` string, `message.tradein.kms` string, `message.fraudinfo.servertimestamp`
string, `message.prospect.suburb` string, `message.callconnect.username` string, `message.callconnect.password`
string, `message.status` string, `message.tradein.colour` string, `message.prospect.address`
string, `message.prospect.state` string, `message.tradein.detailsurl` string, `message.prospect.faxnumber`
string, `message.prospect.companyname` string, `message.prospect.title` string, `message.callconnect.calloutcometext`
string, `message.prospect.preferredcontacttime` string, `message.fraudinfo.devicedata` string,
`message.prospect.preferredcontactmethod` string, `message.assignment.assigned` string, `message.assignment.email`
string, `message.assignment.name` string, `message.country` string, `message.financepackage.id`
string, `message.financepackage.version` string)
> PARTITIONED BY ( `partition_year_utc` string, `partition_month_utc` string, `partition_day_utc`
string, `job_run_guid` string)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED
AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION 's3://datalake/yurib_test/leads_notification'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message