spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chandra Mohan, Ananda Vel Murugan" <Ananda.Muru...@honeywell.com>
Subject Spark sql error while writing Parquet file- Trying to write more fields than contained in row
Date Mon, 18 May 2015 10:29:54 GMT
Hi,

I am using spark-sql to read a CSV file and write it as parquet file. I am building the schema
using the following code.

    String schemaString = "a b c";
           List<StructField> fields = new ArrayList<StructField>();
           MetadataBuilder mb = new MetadataBuilder();
           mb.putBoolean("nullable", true);
           Metadata m = mb.build();
           for (String fieldName: schemaString.split(" ")) {
                fields.add(new StructField(fieldName,DataTypes.DoubleType,true, m));
           }
           StructType schema = DataTypes.createStructType(fields);

Some of the rows in my input csv does not contain three columns. After building my JavaRDD<Row>,
I create data frame as shown below using the RDD and schema.

DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);

Finally I try to save it as Parquet file

darDataFrame.saveAsParquetFile("/home/anand/output.parquet")

I get this error when saving it as Parquet file

java.lang.IndexOutOfBoundsException: Trying to write more fields than contained in row (3
> 2)

I understand the reason behind this error. Some of my rows in Row RDD does not contain three
elements as some rows in my input csv does not contain three columns. But while building the
schema, I am specifying every field as nullable. So I believe, it should not throw this error.
Can anyone help me fix this error. Thank you.

Regards,
Anand.C



Mime
View raw message