spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Vykhodtsev <yoz...@gmail.com>
Subject Dataframes filter by count fails with python API
Date Mon, 29 Jun 2015 06:57:20 GMT
Dear developers,

I found the following behaviour that I think is a minor bug.

If I apply groupBy and count in python API,  the resulting data frame has
grouped columns and the field named "count". Filtering by that field does
not work because it thinks it is a key word:

x = sc.parallelize(zip(xrange(1000),xrange(1000)))
df = sqlContext.createDataFrame(x)

df.groupBy("_1").count().printSchema()

root
 |-- _1: long (nullable = true)
 |-- count: long (nullable = false)


df.groupBy("_1").count().filter("count > 1")

gives

: java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found

count > 1
      ^
	at scala.sys.package$.error(package.scala:27)



the following syntax works :

f = df.groupBy("_1").count()
n = f.filter(f["count"] > 1)

In Scala, referring to $"count" column works as well.

please let me know if I should submit a JIRA for this.

Mime
View raw message