spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Convert raw data files to Parquet format
Date Wed, 23 Jul 2014 18:55:02 GMT
Looks like a bug in your lambda function.  Some of the lines you are
processing must have less than 6 elements, so doing p(5) is failing.


On Wed, Jul 23, 2014 at 11:44 AM, buntu <buntudev@gmail.com> wrote:

> Thanks Michael.
>
> If I read in multiple files and attempt to saveAsParquetFile() I get the
> ArrayIndexOutOfBoundsException. I don't see this if I try the same with a
> single file:
>
> > case class Point(dt: String, uid: String, kw: String, tz: Int, success:
> > Int, code: String )
>
> > val point = sc.textFile("data/raw_data_*").map(_.split("\t")).map(p =>
> > Point(df.format(new Date( p(0).trim.toLong*1000L )), p(1), p(2),
> > p(3).trim.toInt, p(4).trim.toInt ,p(5)))
>
> > point.saveAsParquetFile("point.parquet")
>
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> 14/07/23 11:30:54 ERROR Executor: Exception in task ID 18
> java.lang.ArrayIndexOutOfBoundsException: 1
>         at $line17.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:21)
>         at $line17.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:21)
>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>         at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>         at
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org
> $apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:248)
>         at
>
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:264)
>         at
>
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:264)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> Is this due to the amount of data (about 5M rows) being processed? I've set
> the SPARK_DRIVER_MEMORY to 8g.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Convert-raw-data-files-to-Parquet-format-tp10526p10536.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message