spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheng, Hao" <hao.ch...@intel.com>
Subject RE: [SQL] Self join with ArrayType columns problems
Date Tue, 27 Jan 2015 08:55:33 GMT
The root cause for this probably because the identical “exprId” of the “AttributeReference”
existed while do self-join with “temp table” (temp table = resolved logical plan).
I will do the bug fixing and JIRA creation.

Cheng Hao

From: Michael Armbrust [mailto:michael@databricks.com]
Sent: Tuesday, January 27, 2015 12:05 AM
To: Dean Wampler
Cc: Pierre B; user@spark.apache.org; Cheng Hao
Subject: Re: [SQL] Self join with ArrayType columns problems

It seems likely that there is some sort of bug related to the reuse of array objects that
are returned by UDFs.  Can you open a JIRA?

I'll also note that the sql method on HiveContext does run HiveQL (configured by spark.sql.dialect)
and the hql method has been deprecated since 1.1 (and will probably be removed in 1.3).  The
errors are probably because array and collect set are hive UDFs and thus not available in
a SQLContext.

On Mon, Jan 26, 2015 at 5:44 AM, Dean Wampler <deanwampler@gmail.com<mailto:deanwampler@gmail.com>>
wrote:
You are creating a HiveContext, then using the sql method instead of hql. Is that deliberate?

The code doesn't work if you replace HiveContext with SQLContext. Lots of exceptions are thrown,
but I don't have time to investigate now.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition<http://shop.oreilly.com/product/0636920033073.do>
(O'Reilly)
Typesafe<http://typesafe.com>
@deanwampler<http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Mon, Jan 26, 2015 at 7:17 AM, Pierre B <pierre.borckmans@realimpactanalytics.com<mailto:pierre.borckmans@realimpactanalytics.com>>
wrote:
Using Spark 1.2.0, we are facing some weird behaviour when performing self
join on a table with some ArrayType field.
(potential bug ?)

I have set up a minimal non working example here:
https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
<https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
>
In a nutshell, if the ArrayType column used for the pivot is created
manually in the StructType definition, everything works as expected.
However, if the ArrayType pivot column is obtained by a sql query (be it by
using a "array" wrapper, or using a collect_list operator for instance),
then results are completely off.

Could anyone have a look as this really is a blocking issue.

Thanks!

Cheers

P.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Self-join-with-ArrayType-columns-problems-tp21364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<mailto:user-unsubscribe@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<mailto:user-help@spark.apache.org>


Mime
View raw message