While moving the new data source API to InternalRow, I noticed a few odd things:
- Spark scans always produce
UnsafeRow, but that data is passed around as
InternalRowwith explicit casts.
- Operators expect
InternalRowand nearly all codegen works with
InternalRow(I’ve tested this with quite a few queries.)
- Spark uses unchecked casts from
UnsafeRowin places, assuming that data will be unsafe, even though that isn’t what the type system guarantees.
To me, it looks like the idea was to code SQL operators to the abstract
InternalRowso we can swap out the implementation, but ended up with a general assumption that rows will always be unsafe. This is the worst of both options: we can’t actually rely on everything working with
InternalRowbut code must still use it, until it is inconvenient and an unchecked cast gets inserted.
The main question I want to answer is this: what data format should SQL use internally? What was the intent when building catalyst?
The v2 data source API depends on the answer, but I also found that this introduces a significant performance penalty in Parquet (and probably other formats). A quick check on one of our tables showed a 6% performance hit caused by unnecessary copies from
UnsafeRow. So if we can guarantee that all operators should support
InternalRow, then there is an easy performance win that also simplifies the v2 data source API.
rb--Ryan BlueSoftware EngineerNetflix