spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Bradley <>
Subject Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?
Date Thu, 23 Feb 2017 23:42:35 GMT
+1 for Nick's comment about discussing APIs which need to be made public in !

On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran <>

> On 22 Feb 2017, at 20:51, Shouheng Yi <>
> wrote:
> Hi Spark developers,
> Currently my team at Microsoft is extending Spark’s machine learning
> functionalities to include new learners and transformers. We would like
> users to use these within spark pipelines so that they can mix and match
> with existing Spark learners/transformers, and overall have a native spark
> experience. We cannot accomplish this using a non-“org.apache” namespace
> with the current implementation, and we don’t want to release code inside
> the apache namespace because it’s confusing and there could be naming
> rights issues.
> This isn't actually the ASF has a strong stance against, more left to
> projects themselves. After all: the source is licensed by the ASF, and the
> license doesn't say you can't.
> Indeed, there's a bit of org.apache.hive in the Spark codebase where the
> hive team kept stuff package private. Though that's really a sign that
> things could be improved there.
> Where is problematic is that stack traces end up blaming the wrong group;
> nobody likes getting a bug report which doesn't actually exist in your
> codebase., not least because you have to waste time to even work it out.
> You also have to expect absolutely no stability guarantees, so you'd
> better set your nightly build to work against trunk
> Apache Bahir does put some stuff into, but they've
> sort of inherited that right.when they picked up the code from spark. new
> stuff is going into org.apache.bahir
> We need to extend several classes from spark which happen to have
> “private[spark].” For example, one of our class extends VectorUDT[0] which
> has private[spark] class VectorUDT as its access modifier. This
> unfortunately put us in a strange scenario that forces us to work under the
> namespace org.apache.spark.
> To be specific, currently the private classes/traits we need to use to
> create new Spark learners & Transformers are HasInputCol, VectorUDT and
> Logging. We will expand this list as we develop more.
> I do think tis a shame that logging went from public to private.
> One thing that could be done there is to copy the logging into Bahir,
> under an org.apache.bahir package, for yourself and others to use. That's
> be beneficial to me too.
> For the ML stuff, that might be place to work too, if you are going to
> open source the code.
> Is there a way to avoid this namespace issue? What do other
> people/companies do in this scenario? Thank you for your help!
> I've hit this problem in the past.  Scala code tends to force your hand
> here precisely because of that (very nice) private feature. While it offers
> the ability of a project to guarantee that implementation details aren't
> picked up where they weren't intended to be, in OSS dev, all that
> implementation is visible and for lower level integration,
> What I tend to do is keep my own code in its package and try to do as
> think a bridge over to it from the [private] scope. It's also important to
> name things obviously, say, , so stack traces
> in bug reports can be dealt with more easily
> [0]:
> apache/spark/ml/linalg/VectorUDT.scala
> Best,
> Shouheng


Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image:] <>

View raw message