spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dirceu Semighini Filho <>
Subject SparkR Count vs Take performance
Date Tue, 01 Mar 2016 18:03:41 GMT
Hello all.
I have a script that create a dataframe from this operation:

mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable"))

rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe)
dFrame <- join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT)

After filtering this dFrame with this:

I tried to execute the following
filteredDF <- filterRDD(toRDD(dFrame),function (row) {row['COLUMN'] %in%
c("VALUES", ...)})
Now I need to know if the resulting dataframe is empty, and to do that I
tried this two codes:
if(count(filteredDF) > 0)
if(length(take(filteredDF,1)) > 0)
I thought that the second one, using take, shoule run faster than count,
but that didn't happen.
The take operation creates one job per partition of my rdd (which was 200)
and this make it to run slower than the count.
Is this the expected behaviour?


View raw message