spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernard Jesop <bernard.je...@gmail.com>
Subject Re: Analysis Exception after join
Date Tue, 04 Jul 2017 15:26:07 GMT
Thank Didac,

My bad, actually this code is incomplete, it should have been : - dfAgg =
df.groupBy("S_ID").agg(...).

I want to access the aggregated values (of dfAgg) for each row of 'df',
that is why I do a left outer join.


Also, regarding the second parameter, I am using this signature of join :
(Dataset, Seq[String], String) => Dataset.

It is a sequence of the column names to use as keys for the join, some kind
of syntactic sugar for (df1("key1") === df2("key1") && df1("key2") ===
df2("key2") && ....),
except it will not duplicate the columns used as keys.

2017-07-03 12:39 GMT+02:00 Didac Gil <didacgil9dev@gmail.com>:

> With the left join, you are joining two tables.
>
> In your case, df is the left table, dfAgg is the right table.
> The second parameter should be the joining condition, right?
> For instance
>
>  dfRes = df.join(dfAgg, $”userName”===$”name", "left_outer”)
>
> having a field in df called userName, and another in dfAgg called “name”
>
> However, what’s the kind of query you want to make? dfAgg is already the
> df table that has been grouped by S_ID.
>
> I guess that you are looking for something more like the following example
> dfAgg = df.groupBy("S_ID”)
>    .agg(org.apache.spark.sql.functions.count(*“userName"*).as(
> *“usersCount**”*),
>  .agg(org.apache.spark.sql.functions.collect_set(“city")
> .as("ListofCities”)),
>  .agg(*org.apache.spark.sql.functions.max(“age").as(“oldest”))*
> )
>
> On 3 Jul 2017, at 11:55, Bernard Jesop <bernard.jesop@gmail.com> wrote:
>
> Hello, I don't understand my error message.
>
> Basically, all I am doing is :
> - dfAgg = df.groupBy("S_ID")
> - dfRes = df.join(dfAgg, Seq("S_ID"), "left_outer")
>
> However I get this AnalysisException: "
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> resolved attribute(s) S_ID#1903L missing from Dummy_ID#740,sex#37L,PERSONAL_
> STATUS#726L,W_DEP_CODE#736,W_SIZE#739L,
> POSTAL_CODE#735,COUNTRY_CODE#730,
> ID#724L,Dummy_ID_1#741,DEP_CODE#729,HOUSEHOLD_TYPE#733L,
> HOUSEHOLD_SIZE#734L,AGE#727L,W_ID#738L,H_ID#732L,AGE_TYPE#728,
> S_ID#57L,NATIONALITY#731
> in operator !Project [ID#724L, sex#37L, PERSON\
>  AL_STATUS#726L, AGE#727L, AGE_TYPE#728, DEP_CODE#729, COUNTRY_CODE#730,
> NATIONALITY#731 AS Nationality#77, H_ID#732L, HOUSEHOLD_TYPE#733L,
> HOUSEHOLD_SIZE#734L, POSTAL_CODE#735, W_DEP_CODE#736, S_ID#1903L,
> W_ID#738L, W_SIZE#739L, Dummy_ID#740, Dummy_ID_1#741];; "
>
> What I don't understand is it says S_ID#1903L is missing
> but everything seems fine on the Logical Plan.
> +- Join LeftOuter, (S_ID#57L = S_ID#1903L)
>
>    :- Project [W_ID#14L, H_ID#8L, ID#0L, sex#37L, category#97L, AGE#3L,
> AGE_TYPE#4, DEP_CODE#5, COUNTRY_CODE#6, Nationality#77, HOUSEHOLD_TYPE#9L,
> familySize#117L, POSTAL_CODE#11, W_DEP_CODE#12, S_ID#57\
>  L, workplaceSize#137L, Dummy_ID#16, Dummy_ID_1#17, Inc_period#157,
> Time_inf#1064, Time_inc#200, Health#1014, Inf_period#1039,
> infectedFamily#1355L, infectedWorker#1385L]
>
>     +- Aggregate [S_ID#1903L], [S_ID#1903L, count(1) AS
> infectedStreet#1415L]
>
> Does someone have a clue about it?
> Thanks,
>
>
>
>
> Didac Gil de la Iglesia
> PhD in Computer Science
> didacgil9@gmail.com
> Spain:     +34 696 285 544 <+34%20696%2028%2055%2044>
> Sweden: +46 (0)730229737 <+46%2073%20022%2097%2037>
> Skype: didac.gil.de.la.iglesia
>
>

Mime
View raw message