spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Convert raw data files to Parquet format
Date Wed, 23 Jul 2014 18:56:14 GMT
BTW, I knew this because the top line was "<console>:21".  Anytime you
see "<console>"
that means that the code is something that you typed into the REPL.


On Wed, Jul 23, 2014 at 11:55 AM, Michael Armbrust <michael@databricks.com>
wrote:

> Looks like a bug in your lambda function.  Some of the lines you are
> processing must have less than 6 elements, so doing p(5) is failing.
>
>
> On Wed, Jul 23, 2014 at 11:44 AM, buntu <buntudev@gmail.com> wrote:
>
>> Thanks Michael.
>>
>> If I read in multiple files and attempt to saveAsParquetFile() I get the
>> ArrayIndexOutOfBoundsException. I don't see this if I try the same with a
>> single file:
>>
>> > case class Point(dt: String, uid: String, kw: String, tz: Int, success:
>> > Int, code: String )
>>
>> > val point = sc.textFile("data/raw_data_*").map(_.split("\t")).map(p =>
>> > Point(df.format(new Date( p(0).trim.toLong*1000L )), p(1), p(2),
>> > p(3).trim.toInt, p(4).trim.toInt ,p(5)))
>>
>> > point.saveAsParquetFile("point.parquet")
>>
>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
>> details.
>> 14/07/23 11:30:54 ERROR Executor: Exception in task ID 18
>> java.lang.ArrayIndexOutOfBoundsException: 1
>>         at
>> $line17.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:21)
>>         at
>> $line17.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:21)
>>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>         at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
>>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>         at
>> org.apache.spark.sql.parquet.InsertIntoParquetTable.org
>> $apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:248)
>>         at
>>
>> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:264)
>>         at
>>
>> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:264)
>>         at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> Is this due to the amount of data (about 5M rows) being processed? I've
>> set
>> the SPARK_DRIVER_MEMORY to 8g.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Convert-raw-data-files-to-Parquet-format-tp10526p10536.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Mime
View raw message