spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leif Walsh <>
Subject Re: Possible SPIP to improve matrix and vector column type support
Date Sat, 12 May 2018 22:44:51 GMT
I filed an SPIP for this at Let’s discuss!

On Wed, Apr 18, 2018 at 23:33 Leif Walsh <> wrote:

> I agree we should reuse as much as possible. For PySpark, I think the
> obvious choices of Breeze and numpy arrays already made make a lot of
> sense, I’m not sure about the other language bindings and would defer to
> others.
> I was under the impression that UDTs were gone and (probably?) not coming
> back. Did I miss something and they’re actually going to be better
> supported in the future? I think your second point (about separating
> expanding the primitives from expanding SQL support) is only really true if
> we’re getting UDTs back.
> You’ve obviously seen more of the history here than me. Do you have a
> sense of why the efforts you mentioned never went anywhere? I don’t think
> this is strictly about “mllib local”, it’s more about generic linalg, so
> 19653 feels like the closest to what I’m after, but it looks to me like
> that one just fizzled out, rather than a real back and forth.
> Does this just need something like a persistent product manager to scope
> out the effort, champion it, and push it forward?
> On Wed, Apr 18, 2018 at 20:02 Joseph Bradley <>
> wrote:
>> Thanks for the thoughts!  We've gone back and forth quite a bit about
>> local linear algebra support in Spark.  For reference, there have been some
>> discussions here:
>> Overall, I like the idea of improving linear algebra support, especially
>> given the rise of Python numerical processing & deep learning.  But some
>> considerations I'd list include:
>> * There are great linear algebra libraries out there, and it would be
>> ideal to reuse those as much as possible.
>> * SQL support for linear algebra can be a separate effort from expanding
>> linear algebra primitives.
>> * It would be valuable to discuss external types as UDTs (which can be
>> hacked with numpy and scipy types now) vs. adding linear algebra types to
>> native Spark SQL.
>> On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh <> wrote:
>>> Hi all,
>>> I’ve been playing around with the Vector and Matrix UDTs in and
>>> I’ve found myself wanting more.
>>> There is a minor issue in that with the arrow serialization enabled,
>>> these types don’t serialize properly in python UDF calls or in toPandas.
>>> There’s a natural representation for them in numpy.ndarray, and I’ve
>>> started a conversation with the arrow community about supporting
>>> tensor-valued columns, but that might be a ways out. In the meantime, I
>>> think we can fix this by using the FixedSizeBinary column type in arrow,
>>> together with some metadata describing the tensor shape (list of dimension
>>> sizes).
>>> The larger issue, for which I intend to submit an SPIP soon, is that
>>> these types could be better supported at the API layer, regardless of
>>> serialization. In the limit, we could consider the entire numpy ndarray
>>> surface area as a target. At the minimum, what I’m thinking is that these
>>> types should support column operations like matrix multiply, transpose,
>>> inner and outer product, etc., and maybe have a more ergonomic construction
>>> API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)),
>>> VectorAssembler API is kind of clunky.
>>> One possibility here is to restrict the tensor column types such that
>>> every value must have the same shape, e.g. a 2x2 matrix. This would allow
>>> for operations to check validity before execution, for example, a matrix
>>> multiply could check dimension match and fail fast. However, there might be
>>> use cases for a column to contain variable shape tensors, I’m open to
>>> discussion here.
>>> What do you all think?
>>> --
>>> --
>>> Cheers,
>>> Leif
>> --
>> Joseph Bradley
>> Software Engineer - Machine Learning
>> Databricks, Inc.
>> [image:] <>
> --
> --
> Cheers,
> Leif

View raw message