spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rachmaninovquartet <>
Subject Strange behavior including memory leak and NPE
Date Tue, 19 Jul 2016 18:53:31 GMT

I've been fighting with a strange situation today. I'm trying to add two
entries for each of the distinct rows of an account, except for the first
and last (by date). Here's an example of some of the code. I can't get the
subset to continue forward:

var acctIdList ="m_acct_id").distinct()
acctIdList = acctIdList.filter("m_acct_id is not null")

 for (id <- acctIdList) {
    println("m_acct_id = " + id.getInt(0))
    val subset = X_train.where("m_acct_id in (" + id.getInt(0).toString +

The println's will work, if I remove the subsetting logic from the for loop,
and a few iterations of the loop will work with the subsetting logic. I'm
thinking this might be because the creations of these dataframes in the for
loop are eating up memory too quickly. So I might need a different
implementation. This is the logic I'm trying to translate from pandas, if
that helps:

X_train = pd.concat([X_train.groupby('m_acct_id').apply(lambda x:
pd.concat([x.iloc[i: i + k] for i in range(len(x.index) - k + 1)]))])

and here is the top of the stack trace, I tried on Spark 1.5.2 and 1.6.2:

16/07/19 14:39:37 ERROR Executor: Managed memory leak detected; size =
33816576 bytes, TID = 1908
16/07/19 14:39:37 ERROR Executor: Exception in task 1.0 in stage 96.0 (TID
	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
	at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:755)
	at org.apache.spark.sql.DataFrame.where(DataFrame.scala:792)

Any advice on how to keep moving, would be much appreciated!



View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe e-mail:

View raw message