spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "assaf.mendelson" <>
Subject Re: Data source V2 in spark 2.4.0
Date Thu, 04 Oct 2018 18:31:54 GMT
Thanks for the info.

I have been converting an internal data source to V2 and am now preparing it
for 2.4.0.

I have a couple of suggestions from my experience so far.

First I believe we are missing documentation on this. I am currently writing
an internal tutorial based on what I am learning, I would be happy to share
it once it gets a little better (not sure where it should go though).

The change from using Row to using InternalRow is a little confusing. 
For generic row we can do Row.fromSeq(values) where values are regular java
types (matching the schema). This even includes more complex types like
Array[String] and everything just works.

For IntrenalRow, this doesn't work for non trivial types. I figured out how
to convert strings and timestamps (hopefully I even did it correctly)  but I
couldn't figure Array[String].

Beyond the fact that I would love to learn how to do the conversion
correctly for various types (such as array), I would suggest we should add
some method to create the internal row from base types. In the 2.3.0
version, the row we got from Get would be encoded via an encoder which was
provided. I managed to get it to work by doing:

val encoder = RowEncoder.apply(schema).resolveAndBind() in the constructor
and then encoder.toRow(Row.fromSeq(values))

this simply feels a little weird to me.

Another issue that I encountered is handling bad data. In our legacy source
we have cases where a specific row is bad. What we would do in non spark
code is simply skip it.

The problem is that in spark, if we put next to be true we must have some
row for the get function. This means we always need to read records ahead to
figure out if we actually ha something or not.

Might we instead be allowed to return null from get in which case the line
would just be skipped?

Lastly I would be happy for a means to return metrics from the reading (how
many records we read, how many bad records we have). Perhaps by allowing to
use accumulators in the data source?

Sorry for the long winded message, I will probably have more as I continue
to explore this.


Sent from:

To unsubscribe e-mail:

View raw message