spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Hunter <timhun...@databricks.com>
Subject Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?
Date Fri, 24 Feb 2017 17:08:27 GMT
Regarding logging, Graphframes makes a simple wrapper this way:

https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/
graphframes/Logging.scala

Regarding the UDTs, they have been hidden to be reworked for Datasets, the
reasons being detailed here [1]. Can you describe your use case in more
details? You may be better off copy/pasting the UDT code outside of Spark,
depending on your use case.

[1] https://issues.apache.org/jira/browse/SPARK-14155

On Thu, Feb 23, 2017 at 3:42 PM, Joseph Bradley <joseph@databricks.com>
wrote:

> +1 for Nick's comment about discussing APIs which need to be made public
> in https://issues.apache.org/jira/browse/SPARK-19498 !
>
> On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran <stevel@hortonworks.com>
> wrote:
>
>>
>> On 22 Feb 2017, at 20:51, Shouheng Yi <shouyi@microsoft.com.INVALID>
>> wrote:
>>
>> Hi Spark developers,
>>
>> Currently my team at Microsoft is extending Spark’s machine learning
>> functionalities to include new learners and transformers. We would like
>> users to use these within spark pipelines so that they can mix and match
>> with existing Spark learners/transformers, and overall have a native spark
>> experience. We cannot accomplish this using a non-“org.apache” namespace
>> with the current implementation, and we don’t want to release code inside
>> the apache namespace because it’s confusing and there could be naming
>> rights issues.
>>
>>
>> This isn't actually the ASF has a strong stance against, more left to
>> projects themselves. After all: the source is licensed by the ASF, and the
>> license doesn't say you can't.
>>
>> Indeed, there's a bit of org.apache.hive in the Spark codebase where the
>> hive team kept stuff package private. Though that's really a sign that
>> things could be improved there.
>>
>> Where is problematic is that stack traces end up blaming the wrong group;
>> nobody likes getting a bug report which doesn't actually exist in your
>> codebase., not least because you have to waste time to even work it out.
>>
>> You also have to expect absolutely no stability guarantees, so you'd
>> better set your nightly build to work against trunk
>>
>> Apache Bahir does put some stuff into org.apache.spark.stream, but
>> they've sort of inherited that right.when they picked up the code from
>> spark. new stuff is going into org.apache.bahir
>>
>>
>> We need to extend several classes from spark which happen to have
>> “private[spark].” For example, one of our class extends VectorUDT[0] which
>> has private[spark] class VectorUDT as its access modifier. This
>> unfortunately put us in a strange scenario that forces us to work under the
>> namespace org.apache.spark.
>>
>> To be specific, currently the private classes/traits we need to use to
>> create new Spark learners & Transformers are HasInputCol, VectorUDT and
>> Logging. We will expand this list as we develop more.
>>
>>
>> I do think tis a shame that logging went from public to private.
>>
>> One thing that could be done there is to copy the logging into Bahir,
>> under an org.apache.bahir package, for yourself and others to use. That's
>> be beneficial to me too.
>>
>> For the ML stuff, that might be place to work too, if you are going to
>> open source the code.
>>
>>
>>
>> Is there a way to avoid this namespace issue? What do other
>> people/companies do in this scenario? Thank you for your help!
>>
>>
>> I've hit this problem in the past.  Scala code tends to force your hand
>> here precisely because of that (very nice) private feature. While it offers
>> the ability of a project to guarantee that implementation details aren't
>> picked up where they weren't intended to be, in OSS dev, all that
>> implementation is visible and for lower level integration,
>>
>> What I tend to do is keep my own code in its package and try to do as
>> think a bridge over to it from the [private] scope. It's also important to
>> name things obviously, say,  org.apache.spark.microsoft , so stack traces
>> in bug reports can be dealt with more easily
>>
>>
>> [0]: https://github.com/apache/spark/blob/master/mllib/src/
>> main/scala/org/apache/spark/ml/linalg/VectorUDT.scala
>>
>> Best,
>> Shouheng
>>
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>

Mime
View raw message