spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Saif.A.Ell...@wellsfargo.com>
Subject RE: Parquet without hadoop: Possible?
Date Tue, 11 Aug 2015 15:01:23 GMT
Sorry, I provided bad information. This example worked fine with reduced parallelism.

It seems my problem have to do with something specific with the real data frame at reading
point.

Saif


From: Saif.A.Ellafi@wellsfargo.com [mailto:Saif.A.Ellafi@wellsfargo.com]
Sent: Tuesday, August 11, 2015 11:49 AM
To: deanwampler@gmail.com
Cc: user@spark.apache.org
Subject: RE: Parquet without hadoop: Possible?

I am launching my spark-shell
spark-1.4.1-bin-hadoop2.6/bin/spark-shell

15/08/11 09:43:32 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.

scala> val data = sc.parallelize(Array(2,3,5,7,2,3,6,1)).toDF
scala> data.write.parquet("/var/ data/Saif/pq")

Then I get a million errors:
15/08/11 09:46:01 INFO CodecPool: Got brand-new compressor [.gz]
15/08/11 09:46:01 INFO CodecPool: Got brand-new compressor [.gz]
15/08/11 09:46:01 INFO CodecPool: Got brand-new compressor [.gz]
15/08/11 09:46:07 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:09 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:09 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:09 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:07 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:07 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
        at parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
        at parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57)
        at parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.<init>(ColumnChunkPageWriteStore.java:68)
        at parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.<init>(ColumnChunkPageWriteStore.java:48)
        at parquet.hadoop.ColumnChunkPageWriteStore.getPageWriter(ColumnChunkPageWriteStore.java:215)
        at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:67)
        at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
        at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
        at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
        at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
        at parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
        at parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
        at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
        at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
        at org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83)
        at org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)
        at org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)
        at org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)
        at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)
        at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
        at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task.
...
...
.
15/08/11 09:46:10 ERROR DefaultWriterContainer: Task attempt attempt_201508110946_0000_m_000011_0
aborted.
15/08/11 09:46:10 ERROR Executor: Exception in task 31.0 in stage 0.0 (TID 31)
org.apache.spark.SparkException: Task failed while writing rows.
        at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:191)
        at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
        at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
...



From: Dean Wampler [mailto:deanwampler@gmail.com]
Sent: Tuesday, August 11, 2015 11:39 AM
To: Ellafi, Saif A.
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Parquet without hadoop: Possible?

It should work fine. I have an example script here: https://github.com/deanwampler/spark-workshop/blob/master/src/main/scala/sparkworkshop/SparkSQLParquet10-script.scala
 (Spark 1.4.X)

What does "I am failing to do so" mean?

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition<http://shop.oreilly.com/product/0636920033073.do>
(O'Reilly)
Typesafe<http://typesafe.com>
@deanwampler<http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Tue, Aug 11, 2015 at 9:28 AM, <Saif.A.Ellafi@wellsfargo.com<mailto:Saif.A.Ellafi@wellsfargo.com>>
wrote:
Hi all,

I don’t have any hadoop fs installed on my environment, but I would like to store dataframes
in parquet files. I am failing to do so, if possible, anyone have any pointers?

Thank you,
Saif


Mime
View raw message