spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <>
Subject [jira] [Created] (SPARK-6120) uses too much Java heap space for default spark shell settings
Date Mon, 02 Mar 2015 22:28:05 GMT
Joseph K. Bradley created SPARK-6120:

             Summary: uses too much Java heap space for default spark shell
                 Key: SPARK-6120
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 1.3.0
            Reporter: Joseph K. Bradley

When the Python DecisionTree example in the programming guide is run, it runs out of Java
Heap Space:

scala>, "myModelPath")
[Stage 12:>                                                                           
                                                            (0 + 8) / 8]15/03/02 14:19:16
ERROR Executor: Exception in task 1.0 in stage 12.0 (TID 22)
java.lang.OutOfMemoryError: Java heap space
	at parquet.bytes.CapacityByteArrayOutputStream.initSlabs(
	at parquet.bytes.CapacityByteArrayOutputStream.<init>(
	at parquet.column.values.plain.PlainValuesWriter.<init>(
	at parquet.column.values.dictionary.DictionaryValuesWriter.<init>(
	at parquet.column.values.dictionary.DictionaryValuesWriter$PlainDoubleDictionaryValuesWriter.<init>(
	at parquet.column.ParquetProperties.getValuesWriter(
	at parquet.column.impl.ColumnWriterImpl.<init>(
	at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(
	at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(
	at parquet.hadoop.InternalParquetRecordWriter.initStore(
	at parquet.hadoop.InternalParquetRecordWriter.<init>(
	at parquet.hadoop.ParquetRecordWriter.<init>(
	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(
	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(
	at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641)
	at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
	at org.apache.spark.executor.Executor$
	at java.util.concurrent.ThreadPoolExecutor.runWorker(
	at java.util.concurrent.ThreadPoolExecutor$

When saving using JSON format instead of Parquet, this works.  It seems to be caused by Parquet
requiring a lot of metadata to describe the schema.

I'm labeling this a bug since it should succeed with the default spark-shell settings.  Potential
fixes are:
* increasing spark-shell default heap space settings (This is probably too hard to agree on
* not using Parquet for storage (This would be good for small examples but probably worse
for large models, where Parquet would be more efficient than other formats.)
* compressing the schema (The various values in the DecisionTree model could be flattened
into a single Seq of Double.  This may be the best option for now.)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message