systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 281165...@qq.com
Subject Refactor ML code logic to reduce duplicate codes
Date Sat, 23 Apr 2016 07:59:42 GMT
Hi,
I found there are many duplicate codes in different packages. For example, readScript logic
exist in several places. In DMLScript.java there are many static methods, it seems not a good
design. I felt the whole ML process more like a stream or workflow, does it make sense to
create a DSL such as:
MLStreams.from(input).withCompiler().withParser().execute().to(output)?




------------------ Original ------------------
From:  "Deron Eriksson";<deroneriksson@gmail.com>;
Date:  Tue, Apr 12, 2016 01:45 AM
To:  "dev"<dev@systemml.incubator.apache.org>; 

Subject:  Re: Fw: Updating documentation for notebook



Hi Niketan,

I think a separate section for Notebooks is a great idea since, as you
point out, they are hidden under the MLContext section. Also, I really like
the idea of making it as easy as possible for a new user to try out
SystemML in a Notebook. Very good points.

Tutorials for all the algorithms using real-world data would be fantastic.
To me, I would also like to see single-line algorithm invocations (possibly
with generated data) that could be copy/pasted that work with no
modifications needed by the user. This would probably mean either including
small sets of example data in the project, or allowing the reading of data
from URLs.

It would be nice to take something like these 5 commands:
---
$ wget
https://raw.githubusercontent.com/apache/incubator-systemml/master/scripts/datagen/genRandData4Univariate.dml
$ $SPARK_HOME/bin/spark-submit $SYSTEMML_HOME/SystemML.jar -f
genRandData4Univariate.dml -exec hybrid_spark -args 1000000 100 10 1 2 3 4
uni.mtx
$ echo '1' > uni-types.csv
$ echo '{"rows": 1, "cols": 1, "format": "csv"}' > uni-types.csv.mtd
$ $SPARK_HOME/bin/spark-submit $SYSTEMML_HOME/SystemML.jar -f
$SYSTEMML_HOME/algorithms/Univar-Stats.dml -exec hybrid_spark -nvargs
X=uni.mtx TYPES=uni-types.csv STATS=uni-stats.txt
---
and reduce this to 1 command (in the documentation) that the user can
copy/paste and the algorithm runs without any additional work needed by the
user:
---
$ $SPARK_HOME/bin/spark-submit $SYSTEMML_HOME/SystemML.jar -f
$SYSTEMML_HOME/algorithms/Univar-Stats.dml -exec hybrid_spark -nvargs X=
http://www.example.com/uni.mtx TYPES=http://www.example.com/uni-types.csv
STATS=uni-stats.txt
---
If we had this for each of the main algorithms, this would give the users
working examples to start with, which is easier than trying to figure out
this kind of thing by reading the comments in the DML algorithm files.

Deron


On Fri, Apr 8, 2016 at 4:51 PM, Niketan Pansare <npansar@us.ibm.com> wrote:

> Hi all,
>
> As per Luciano's suggestion, I have create a PR with bluemix/datascientist
> tutorial and have flagged it with "Please DONOT push this PR until the
> discussion on dev mailing list is complete." :)
>
> Also, I apologize for incorrect indentation in last email. Here is another
> attempt:
> - How do you want try SystemML ?
> --+ Notebook on cloud
> ----* Bluemix
> ------ + Zeppelin
> ----------- Using Python Kernel
> ------------ + Learn how to write DML program--(something along the lines
> of
> http://apache.github.io/incubator-systemml/beginners-guide-to-dml-and-pydml.html
> )
> ------------ + Try out pre-packaged algorithms on real-world dataset
> -------------- * Linear Regression
> -------------- * GLM
> -------------- * ALS
> -------------- * ...
> ------------ + Learn how to pass RDD/DataFrame to SystemML
> ------------ + Learn how to use SystemML as MLPipeline
> estimator/transformer
> ------------ + Learn how to use SystemML with existing Python packages
> ----------- Using Scala Kernel
> ------------ + ... similar to Python kernel
> ----------- Using DML Kernel
> ------------ + Learn how to write DML program
> ------ + Jupyter
> --------- Using Python Kernel
> --------- Using Scala Kernel
> --------- Using DML Kernel
> ----* Data scientist's work bench
> ----* Databricks cloud
> ----* ...
> --+ Notebook on laptop/cluster
> ----* Zeppelin
> ----* Jupyter
> --+ Laptop
> ----* Run SystemML as Standalone jar:
> http://apache.github.io/incubator-systemml/quick-start-guide.html
> ----* Embed SystemML into other Java program:
> http://apache.github.io/incubator-systemml/jmlc.html
> ----* Debug a DML script:
> http://apache.github.io/incubator-systemml/debugger-guide.html
> ----* Spark local mode
> --+ Spark Cluster
> ----* Batch invocation
> ----* Using Spark REPL
> ------+ Learn how to pass RDD/DataFrame to SystemML
> ------+ Learn how to use SystemML as MLPipeline estimator/transformer
> ----* Using PySpark REPL
> ------+ Learn how to pass RDD/DataFrame to SystemML
> ------+ Learn how to use SystemML as MLPipeline estimator/transformer
> --+ Hadoop Cluster
> --+ Spark Cluster on EC2
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> ----- Forwarded by Niketan Pansare/Almaden/IBM on 04/08/2016 04:48 PM
> -----
>
>
>
> *Fw: Updating documentation for notebook*
>
> *Niketan Pansare *
> to:
> dev
> 04/08/2016 01:11 PM
>
>
>
>
> From:
> Niketan Pansare/Almaden/IBM
>
>
>
>
> To:
> dev <dev@systemml.incubator.apache.org>
>
> Hi all,
>
> Here are few suggestions to get things started:
> 1. Have a "Quick Start" (or "Get Started") button besides "Get SystemML"
> on http://systemml.apache.org/.
>
> 2. Then user can go through following questionnaire/bulleted list which
> points people to appropriate link:
> - How do you want try SystemML ?
> + Notebook on cloud
> * Bluemix
> + Zeppelin
> - Using Python Kernel
> + Learn how to write DML program (something along the lines of
> http://apache.github.io/incubator-systemml/beginners-guide-to-dml-and-pydml.html
> )
> + Try out pre-packaged algorithms on real-world dataset
> * Linear Regression
> * GLM
> * ALS
> * ...
> + Learn how to pass RDD/DataFrame to SystemML (for example:
> http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html
> )
> + Learn how to use SystemML as MLPipeline estimator/transformer
> + Learn how to use SystemML with existing Python packages
> - Using Scala Kernel
> + ... similar to Python kernel
> - Using DML Kernel
> + Learn how to write DML program
> + Jupyter
> - Using Python Kernel
> - Using Scala Kernel
> - Using DML Kernel
> * Data scientist's work bench
> * Databricks cloud
> * ...
>
> + Notebook on laptop/cluster
> * Zeppelin using docker images (for example:
> http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html#zeppelin-notebook-example---linear-regression-algorithm
> )
> * Jupyter (for example:
> http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization
> )
>
> + Laptop
> * Run SystemML as Standalone jar:
> http://apache.github.io/incubator-systemml/quick-start-guide.html
> * Embed SystemML into other Java program:
> http://apache.github.io/incubator-systemml/jmlc.html
> * Debug a DML script:
> http://apache.github.io/incubator-systemml/debugger-guide.html
> * Spark local mode
>
> + Spark Cluster
> * Batch invocation
> * Using Spark REPL
> + Learn how to pass RDD/DataFrame to SystemML
> + Learn how to use SystemML as MLPipeline estimator/transformer
> * Using PySpark REPL
> + Learn how to pass RDD/DataFrame to SystemML
> + Learn how to use SystemML as MLPipeline estimator/transformer
>
> + Hadoop Cluster
> + Spark Cluster on EC2
>
> 3. Add links to SystemML presentations:
> https://www.youtube.com/watch?v=n3JJP6UbH6Q
> https://www.youtube.com/watch?v=6VpiJK8Jydw
> https://www.youtube.com/watch?v=PV-5pZboo4A
> https://www.youtube.com/watch?v=7Zrc5EzOTjg
> https://www.youtube.com/watch?v=3T32lweGxOA
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> ----- Forwarded by Niketan Pansare/Almaden/IBM on 04/08/2016 01:03 PM -----
>
>
>
> *Re: Updating documentation for notebook*
>
> *Niketan Pansare *
> to:
> dev
> 04/08/2016 10:47 AM
>
> *Please respond to dev*
>
> Thanks Abhishek. I am glad it was helpful :)
>
> Luciano: I agree with you about having a central place for documentation.
> Before cleaning up the tutorial and putting it into our documentation, I
> wanted to:
> 1. Have a discussion about which setup should we use to introduce
> SystemML: command-line standalone, command-line spark/pyspark REPL
> (yarn/standalone), command-line hadoop, scala/python notebook (online
> notebook or require user to setup jupyter/zeppelin).
> 2. Encourage other contributors to come up with intellectually simulating
> tutorial using real world dataset and our existing DML algorithms. This
> means creating JIRAs that people can work on. My repository is only a POC
> to facilitate discussion and will be deleted after that.
> 3. If we do decide to go with online notebook based tutorial, have a
> discussion on how to structure the tutorial:
> - so as to support variety of hosting sites (bluemix / datascientist
> workbench / databricks cloud / azureml / aws / ...).
> - Python or Scala as primary language.
> - Jupyter or Zeppelin as primary notebook.
> - DML kernel or MLContext-based or JMLC-based example.
> - Any standard tutorial (or textbook) we should use as example for
> choosing the dataset.
> - Whether the emphasis should be on learning DML or on building larger
> data pipeline (for example: our MLPipeline-wrapper).
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> *http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar*
> <http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar>
>
> Abhishek Srivastava ---04/08/2016 08:55:58 AM---Great job Niketan , I had
> been searching for such document off late. Regards,
>
> From: Abhishek Srivastava <abhisheksrivastava3@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 04/08/2016 08:55 AM
> Subject: Re: Updating documentation for notebook
> ------------------------------
>
>
>
> Great job Niketan , I had been searching for such document off late.
>
> Regards,
> Abhishek Srivastava
> Fellowship Scholar , IIM Ranchi
> Skype : abhi.sri3
>
> On Fri, Apr 8, 2016 at 6:34 AM, Niketan Pansare <npansar@us.ibm.com>
> wrote:
>
> >
> >
> > Hi all,
> >
> > Here is a suggestion for reducing the barrier to entry for SystemML:
> "Have
> > a detailed quickstart guide/video using Notebook on free (or trial-based)
> > hosting solution like IBM Bluemix or Data Scientist Workbench".
> >
> > I have create a sample tutorial:
> > *https://github.com/niketanpansare/systemml_tutorial*
> <https://github.com/niketanpansare/systemml_tutorial>
> >
> > Missing items in above tutorial:
> > 1. Create a separate section for Notebook rather than have it hidden
> under
> > MLContext Programming guide (
> >
> >
> *http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html*
> <http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html>
> > ).
> > 2. Add Python Notebooks (This requires attaching both jars and python
> > MLContext to Zeppelin or Jupyter context).
> > 3. Allow users to use jars from our nightly build (see my jupyter
> example)
> > as well as released version (see my zeppelin example).
> > 4. Tutorials for all our algorithms using real world dataset. Example:
> >
> >
> *https://www.ibm.com/support/knowledgecenter/SSPT3X_2.1.2/com.ibm.swg.im.infosphere.biginsights.tut.doc/doc/tut_Mod_BigR.html*
> <https://www.ibm.com/support/knowledgecenter/SSPT3X_2.1.2/com.ibm.swg.im.infosphere.biginsights.tut.doc/doc/tut_Mod_BigR.html>
> > .
> > 5. DML Kernel for Zeppelin (see
> > *https://issues.apache.org/jira/browse/SYSTEMML-542*
> <https://issues.apache.org/jira/browse/SYSTEMML-542>).
> > 6. Other hosting services such as AzureML.
> > 7. Tutorial that shows SystemML's integration with MLPipeline.
> >
> > These missing items can be broken down into relatively small tasks with
> > detailed specification that external contributors can work on. Any
> > thoughts ?
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > *http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar*
> <http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar>
> >
>
>
>
>
>
>
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message