Thanks for the thoughts!  We've gone back and forth quite a bit about local linear algebra support in Spark.  For reference, there have been some discussions here:

Overall, I like the idea of improving linear algebra support, especially given the rise of Python numerical processing & deep learning.  But some considerations I'd list include:
* There are great linear algebra libraries out there, and it would be ideal to reuse those as much as possible.
* SQL support for linear algebra can be a separate effort from expanding linear algebra primitives.
* It would be valuable to discuss external types as UDTs (which can be hacked with numpy and scipy types now) vs. adding linear algebra types to native Spark SQL.

On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh <> wrote:
Hi all,

I’ve been playing around with the Vector and Matrix UDTs in and I’ve found myself wanting more. 

There is a minor issue in that with the arrow serialization enabled, these types don’t serialize properly in python UDF calls or in toPandas. There’s a natural representation for them in numpy.ndarray, and I’ve started a conversation with the arrow community about supporting tensor-valued columns, but that might be a ways out. In the meantime, I think we can fix this by using the FixedSizeBinary column type in arrow, together with some metadata describing the tensor shape (list of dimension sizes). 

The larger issue, for which I intend to submit an SPIP soon, is that these types could be better supported at the API layer, regardless of serialization. In the limit, we could consider the entire numpy ndarray surface area as a target. At the minimum, what I’m thinking is that these types should support column operations like matrix multiply, transpose, inner and outer product, etc., and maybe have a more ergonomic construction API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the VectorAssembler API is kind of clunky. 

One possibility here is to restrict the tensor column types such that every value must have the same shape, e.g. a 2x2 matrix. This would allow for operations to check validity before execution, for example, a matrix multiply could check dimension match and fail fast. However, there might be use cases for a column to contain variable shape tensors, I’m open to discussion here. 

What do you all think? 


Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.