spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-5387) parquet writer runs into OOM during writing when number of rows is large
Date Thu, 19 Mar 2015 12:31:38 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14369259#comment-14369259
] 

Apache Spark commented on SPARK-5387:
-------------------------------------

User 'debugger87' has created a pull request for this issue:
https://github.com/apache/spark/pull/5089

> parquet writer runs into OOM during writing when number of rows is large
> ------------------------------------------------------------------------
>
>                 Key: SPARK-5387
>                 URL: https://issues.apache.org/jira/browse/SPARK-5387
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 1.1.1
>            Reporter: Shirley Wu
>
> When the number of records is large in RDD, the saveAsParquet will have OOM.
> Here is the stack trace:
>     15/01/23 10:00:02 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, hdc2-s3.niara.com):
java.lang.OutOfMemoryError: Java heap space
>         parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
>         parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57)
>         parquet.column.values.rle.RunLengthBitPackingHybridEncoder.<init>(RunLengthBitPackingHybridEncoder.java:125)
>         parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.<init>(RunLengthBitPackingHybridValuesWriter.java:36)
>         parquet.column.ParquetProperties.getColumnDescriptorValuesWriter(ParquetProperties.java:61)
>         parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:73)
>         parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
>         parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
>         parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:124)
>         parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:315)
>         parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:106)
>         parquet.hadoop.InternalParquetRecordWriter.checkBlockSizeReached(InternalParquetRecordWriter.java:126)
>         parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:117)
>         parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>         parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>         org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:303)
>         org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
>         org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
>         org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>         org.apache.spark.scheduler.Task.run(Task.scala:54)
>         org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)
>         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         java.lang.Thread.run(Thread.java:745)
> It seems the writeShard() API needs to flush to disk periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message