spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Paul.Baurie...@telekom.de>
Subject Parallelize Join Problem
Date Mon, 08 Apr 2019 15:41:09 GMT
Hi,
I'm struggling with a join of two large DataFrames. The join is extremely slow because it
is only executed on one worker.  At the first checkpoint spark uses all four workers, but
at the second it only uses one.
I first thought it might have something to do with that spark wants to load the netlib libraries
in this stages, but I have no idea if that has even anything to with this problem at all.
19-04-08 15:31:50 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
19-04-08 15:31:50 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
19-04-08 15:31:50 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
19-04-08 15:31:50 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPAC

Does anyone has a hint for me where to look for the bottleneck.

taxidataFiltered
     .withColumn("time_taxi", col("time_utc").cast(DoubleType))
     .select(col("time_taxi"),
       col("x_longitude_wgs84"),
       col("y_latitude_wgs84"),
       col("imsi_hash"))
     .checkpoint()
     .join(df,
       col("time_taxi") === df.col("time")
         && taxidataFiltered.col("hash") === df.col("hash"),
       "OUTER")
     .checkpoint()
    ....

[cid:image001.jpg@01D4EE32.3F6EABA0]

Thanks in advance,
Paul

Mime
View raw message