spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Navin Viswanath <>
Subject Custom file datasource implementations
Date Sun, 22 Nov 2020 18:36:13 GMT
I’m looking into upgrading our Spark version from 2.4 to 3.0 and noticed that there was
an API change that seems unavoidable. I wanted to check whether our implementation is the
ideal way to do this and if this change is necessary.
We have files on HDFS in thrift format and implement custom datasources in V1 by implementing
a PartitionFile => InternalRow conversion(based on the FileFormat <>
trait). Here is how we do that currently:

Given a thrift type T,
val schema: StructType = ... // infer thrift schema for an object of type T
val encoder: ExpressionEncoder[Row] = RowEncoder(schema)
val genericRow: GenericRow = toGenericRow(thriftObject, schema) // converts a thrift object
to a GenericRow

Once we have a GenericRow and an ExpressionEncoder, we used to do the following to produce
an InternalRow:

val internalRow: InternalRow = encoder.toRow(genericRow)

With the change introduced in

we now need:

val serializer = encoder.createSerializer()
val internalRow: InternalRow = serializer(genericRow)

I have a couple of questions about this:
1. Do we have to make this change in our sources going from 2.4 to 3.0? Since this change
is marked as internal in the PR, I was wondering if this is not the ideal way to implement
2. Does moving to Datasource V2 avoid this change? From looking at the V2 API, it looks like
I would still need to generate an InternalRow and an ExpressionEncoder appears to be the only
way to do this, which implies I would still need this change.

My goal is to try and keep our source code the same going from 2.4 to 3.0 if possible.

Any help is appreciated!

View raw message