spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maciej Szymkiewicz (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-29212) Add common classes without using JVM backend
Date Wed, 02 Oct 2019 13:26:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942809#comment-16942809
] 

Maciej Szymkiewicz commented on SPARK-29212:
--------------------------------------------

[~podongfeng] It sounds about right. I will also argue that, conditioning on 1., we should
remove Java specific mixins, if they don't serve any practical value (provide no implementation
whatsoever, like {{JavaPredictorParams}}, or have no JVM wrapper specific implementation,
like {{JavaPredictor}}).


As of the second point there is additional consideration here - some {{Java*}} classes are
considered part of the  public API, and this should stay as is (these provide crucial information
to the end user). However deeper we go, the less useful they are (once again conditioning
on 1.).

On a side note current approach to ML API  requires a lot of boilerplate code. Lately I've
been playing with [some ideas|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611],
that wouldn't require code generation - they have some caveats, but maybe there is something
there. 

> Add common classes without using JVM backend
> --------------------------------------------
>
>                 Key: SPARK-29212
>                 URL: https://issues.apache.org/jira/browse/SPARK-29212
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, PySpark
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Priority: Major
>
> copyed from [https://github.com/apache/spark/pull/25776.]
>  
> Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  ** I am Python user and want to implement pure Python ML classifier without providing
JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, {{Model}}, {{MLReader}}
/ {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and existing class
hierarchy,
>  ** I am a library developer who provides functions that are meaningful only for specific
categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really satisfying (actually,
what is the rationale behind making it "private"? If the goal is Scala API parity, and Scala
counterparts are public, shouldn't these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on {{numClasses}} for
classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or {{PythonClassifier}}
and so on) and require users to implement such interfaces. That however would require separate
logic for checking for built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in classes
and require users to do the same.
> Clearly these are not satisfying solutions as they require either defensive programming
or reinventing the same functionality for different 3rd party APIs.
>  * Static type checking
>  ** I am either end user or library developer and want to use PEP-484 annotations to
indicate components that require classifier or classification model.
>  ** Currently I can provide only imprecise annotations, [such as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
> def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
> or try to narrow things down using structural subtyping:
> class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, value: str) ->
Classifier: ... class Classifier(Protocol, Model): def setRawPredictionCol(self, value: str)
-> Model: ... def numClasses(self) -> int: ...
>  
> Maciej's *Proposal*:
> {code:java}
> Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
> class ClassifierParams: ...
> class Predictor(Estimator,PredictorParams):
>     def setLabelCol(self, value): ...
>     def setFeaturesCol(self, value): ...
>     def setPredictionCol(self, value): ...
> class Classifier(Predictor, ClassifierParams):
>     def setRawPredictionCol(self, value): ...
> class PredictionModel(Model,PredictorParams):
>     def setFeaturesCol(self, value): ...
>     def setPredictionCol(self, value): ...
>     def numFeatures(self): ...
>     def predict(self, value): ...
> and JVM interop should extend from this hierarchy, i.e.
> class JavaPredictionModel(PredictionModel): ...
> In other words it should be consistent with existing approach, where we have ABCs reflecting
Scala API (Transformer, Estimator, Model) and so on, and Java* variants are their subclasses.
>  {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message