spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anil Langote <>
Subject Spark Inner Join on pivoted datasets results empty dataset
Date Thu, 19 Oct 2017 21:01:54 GMT
Hi All,

I have a requirement to pivot multiple columns using single columns, the
pivot API doesn't support doing that hence I have been doing pivot for two
columns and then trying to merge the dataset the result is producing empty
dataset. Below is the sudo code

Main dataset => 33 columns (30 columns are string and 2 columns are of type
double array lets say vector1 and vector2, 1 column Decider which has 0 & 1

String grouByColumns =  "col1,col2,col3,col4,col5,col6.......col30";
Vector columns : Vector1 and Vector2

i do pivot like below

List< Object > values = new ArrayList<Object>();

Dataset<Row> pivot1 =
pivot1 = pivot1.withColumRenamed("0","Vector1_0");
pivot1 = pivot1.withColumRenamed("1","Vector1_1");

*Count on pivot1* = 12856

Dataset<Row> pivot2 =
pivot2 = pivot2.withColumRenamed("0","Vector2_0");
pivot2 = pivot2.withColumRenamed("1","Vector2_1");

*Count on pivot2* = 12856

Dataset<Row> finalDataset = pivot1.join(pivot2,Seq<grouByColumns >);

*Count on pivot1 *= 0 ? Why this sould be 12856  right?

The same code works on local with less columns and 100 records.

Is there anything i am missing here is there any better way to pivot the
multiple columns i can not do combine because my aggregation columns are
array of doubles.

The pivot1 & pivot2 dataset derived by same parent dataset the group by
columns are same all i am doing is inner join on these two dataset with
same group by columns why it doesn't work?

Thank you
Anil Langote

View raw message