spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dean Arnold <renodino...@gmail.com>
Subject Inconsistent dataset behavior between file and in-memory versions
Date Thu, 12 Sep 2019 18:41:59 GMT
I have some code to recover a complex structured row from a dataset.
The row contains several ARRAY fields (mostly Array(IntegerType)),
which are populated with Array[java.lang.Integer], as that seems to be
the only way the Spark row serializer will accept them.

If the dataset is written out to a file (parquet in this case), and
then read back in
from the file, Row.getList() (either scala or java) works fine, and I
get a List. But if I simply apply the created dataset into another
dataset iterator, Row.getList() throws an exception:

java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to
scala.collection.Seq

On top of that mess, the array fields of the row which were assigned a
null show up as non-null empty arrays, yet when written out to a file
and then read back, they are actually null.

Why isn't the behavior consistent ? And why isn't there a
Row.getArray() ? Will any of this nonsense be fixed in 3.0 ?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message