spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Terry Siu <Terry....@smartfocus.com>
Subject Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
Date Wed, 15 Oct 2014 16:08:00 GMT
Hi Yin,

pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 0.12 from existing
Avro data using CREATE TABLE following by an INSERT OVERWRITE. These are partitioned tables
- pqt_rdt_snappy has one partition while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy,
I noticed that when I populated it with a single INSERT OVERWRITE over all the partitions
and then executed the Spark code, it would report an illegal index value of 29.  However,
if I manually did INSERT OVERWRITE for every single partition, I would get an illegal index
value of 21. I don’t know if this will help in debugging, but here’s the DESCRIBE output
for pqt_segcust_snappy:


OK

col_name        data_type       comment

customer_id             string                  from deserializer

age_range               string                  from deserializer

gender                  string                  from deserializer

last_tx_date            bigint                  from deserializer

last_tx_date_ts         string                  from deserializer

last_tx_date_dt         string                  from deserializer

first_tx_date           bigint                  from deserializer

first_tx_date_ts        string                  from deserializer

first_tx_date_dt        string                  from deserializer

second_tx_date          bigint                  from deserializer

second_tx_date_ts       string                  from deserializer

second_tx_date_dt       string                  from deserializer

third_tx_date           bigint                  from deserializer

third_tx_date_ts        string                  from deserializer

third_tx_date_dt        string                  from deserializer

frequency               double                  from deserializer

tx_size                 double                  from deserializer

recency                 double                  from deserializer

rfm                     double                  from deserializer

tx_count                bigint                  from deserializer

sales                   double                  from deserializer

coll_def_id             string                  None

seg_def_id              string                  None



# Partition Information

# col_name              data_type               comment



coll_def_id             string                  None

seg_def_id              string                  None

Time taken: 0.788 seconds, Fetched: 29 row(s)


As you can see, I have 21 data columns, followed by the 2 partition columns, coll_def_id and
seg_def_id. Output shows 29 rows, but that looks like it’s just counting the rows in the
console output. Let me know if you need more information.


Thanks

-Terry


From: Yin Huai <huaiyin.thu@gmail.com<mailto:huaiyin.thu@gmail.com>>
Date: Tuesday, October 14, 2014 at 6:29 PM
To: Terry Siu <terry.siu@smartfocus.com<mailto:terry.siu@smartfocus.com>>
Cc: Michael Armbrust <michael@databricks.com<mailto:michael@databricks.com>>,
"user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

Hello Terry,

How many columns does pqt_rdt_snappy have?

Thanks,

Yin

On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu <Terry.Siu@smartfocus.com<mailto:Terry.Siu@smartfocus.com>>
wrote:
Hi Michael,

That worked for me. At least I’m now further than I was. Thanks for the tip!

-Terry

From: Michael Armbrust <michael@databricks.com<mailto:michael@databricks.com>>
Date: Monday, October 13, 2014 at 5:05 PM
To: Terry Siu <terry.siu@smartfocus.com<mailto:terry.siu@smartfocus.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

There are some known bug with the parquet serde and spark 1.1.

You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark sql to use
built in parquet support when the serde looks like parquet.

On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu <Terry.Siu@smartfocus.com<mailto:Terry.Siu@smartfocus.com>>
wrote:
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our cluster is
CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables that point to Parquet (compressed
with Snappy), which were converted over from Avro if that matters.

I am trying to perform a join with these two Hive tables, but am encountering an exception.
In a nutshell, I launch a spark shell, create my HiveContext (pointing to the correct metastore
on our cluster), and then proceed to do the following:

scala> val hc = new HiveContext(sc)

scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >= 1325376000000
and translate <= 1340063999999”)

scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where coll_def_id=‘abcd’”)

scala> txn.registerAsTable(“segTxns”)

scala> segcust.registerAsTable(“segCusts”)

scala> val joined = hc.sql(“select t.transid, c.customer_id from segTxns t join segCusts
c on t.customerid=c.customer_id”)

Straight forward enough, but I get the following exception:


14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 51)

java.lang.IndexOutOfBoundsException: Index: 21, Size: 21

        at java.util.ArrayList.rangeCheck(ArrayList.java:635)

        at java.util.ArrayList.get(ArrayList.java:411)

        at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)

        at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)

        at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81)

        at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:67)

        at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)

        at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:197)

        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)

        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

        at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

        at org.apache.spark.scheduler.Task.run(Task.scala:54)

        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)


The number of columns in my table, pqt_segcust_snappy, has 21 columns and two partitions defined.
Does this error look familiar to anyone? Could my usage of SparkSQL with Hive be incorrect
or is support with Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0?


Thanks,

-Terry





Mime
View raw message