spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?
Date Thu, 23 Feb 2017 10:36:06 GMT

On 22 Feb 2017, at 20:51, Shouheng Yi <shouyi@microsoft.com.INVALID<mailto:shouyi@microsoft.com.INVALID>>
wrote:

Hi Spark developers,

Currently my team at Microsoft is extending Spark’s machine learning functionalities to
include new learners and transformers. We would like users to use these within spark pipelines
so that they can mix and match with existing Spark learners/transformers, and overall have
a native spark experience. We cannot accomplish this using a non-“org.apache” namespace
with the current implementation, and we don’t want to release code inside the apache namespace
because it’s confusing and there could be naming rights issues.

This isn't actually the ASF has a strong stance against, more left to projects themselves.
After all: the source is licensed by the ASF, and the license doesn't say you can't.

Indeed, there's a bit of org.apache.hive in the Spark codebase where the hive team kept stuff
package private. Though that's really a sign that things could be improved there.

Where is problematic is that stack traces end up blaming the wrong group; nobody likes getting
a bug report which doesn't actually exist in your codebase., not least because you have to
waste time to even work it out.

You also have to expect absolutely no stability guarantees, so you'd better set your nightly
build to work against trunk

Apache Bahir does put some stuff into org.apache.spark.stream, but they've sort of inherited
that right.when they picked up the code from spark. new stuff is going into org.apache.bahir


We need to extend several classes from spark which happen to have “private[spark].” For
example, one of our class extends VectorUDT[0] which has private[spark] class VectorUDT as
its access modifier. This unfortunately put us in a strange scenario that forces us to work
under the namespace org.apache.spark.

To be specific, currently the private classes/traits we need to use to create new Spark learners
& Transformers are HasInputCol, VectorUDT and Logging. We will expand this list as we
develop more.

I do think tis a shame that logging went from public to private.

One thing that could be done there is to copy the logging into Bahir, under an org.apache.bahir
package, for yourself and others to use. That's be beneficial to me too.

For the ML stuff, that might be place to work too, if you are going to open source the code.



Is there a way to avoid this namespace issue? What do other people/companies do in this scenario?
Thank you for your help!

I've hit this problem in the past.  Scala code tends to force your hand here precisely because
of that (very nice) private feature. While it offers the ability of a project to guarantee
that implementation details aren't picked up where they weren't intended to be, in OSS dev,
all that implementation is visible and for lower level integration,

What I tend to do is keep my own code in its package and try to do as think a bridge over
to it from the [private] scope. It's also important to name things obviously, say,  org.apache.spark.microsoft
, so stack traces in bug reports can be dealt with more easily


[0]: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala

Best,
Shouheng

Mime
View raw message