spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jaeboo Jung (JIRA)" <>
Subject [jira] [Created] (SPARK-19623) Take rows from DataFrame with empty first partition
Date Thu, 16 Feb 2017 07:45:41 GMT
Jaeboo Jung created SPARK-19623:

             Summary: Take rows from DataFrame with empty first partition
                 Key: SPARK-19623
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.2
            Reporter: Jaeboo Jung
            Priority: Minor

I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions having a empty
first partition, DataFrame and its RDD have different behaviors during taking rows from it.
If we take only 1000 rows from DataFrame, it causes OOME but RDD is OK.
In detail,
DataFrame without a empty first partition => OK
DataFrame with a empty first partition => OOME
RDD of DataFrame with a empty first partition => OK
Codes below reproduce this error.
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(1 to 100000000,1000).map(i => Row.fromSeq(Array.fill(100)(i)))
val schema = StructType(for(i <- 1 to 100) yield {
StructField("COL"+i,IntegerType, true)
val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) Iterator[Row]()
else iter)
val df1 = sqlContext.createDataFrame(rdd,schema)
df1.take(1000) // OK
val df2 = sqlContext.createDataFrame(rdd2,schema)
df2.rdd.take(1000) // OK
df2.take(1000) // OOME

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message