systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Dusenberry <dusenberr...@gmail.com>
Subject Re: Updating documentation for notebook
Date Tue, 12 Apr 2016 00:25:04 GMT
I'm very much in favor of getting a good set of simple tutorials focused
specifically on notebook integration.  I would suggest that we aim more
towards open source Jupyter & Zeppelin rather than Bluemix and other hosted
versions, unless users can quickly and easily access a test notebook on the
later options.

+1 on including the DML scripts in the JAR specifically for use with the
MLPipeline wrappers.  Additionally, we should promote Scala for these
future wrappers now that we have the infrastructure in place and one
example wrapper written in Scala. :)

As part of this push to focus on notebook integration, we really need to
continue the discussion Deron is leading on the MLContext redesign so that
it will be easier for users to understand and use, particularly with
regards to input and output from DML scripts.

- Mike

--

Michael W. Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

On Apr 11, 2016, at 11:59 AM, Niketan Pansare <npansar@us.ibm.com> wrote:

Hi Deron,

I too like the idea of having a single command but rather than supporting
web-datasets in read(), how about having a Java/Scala wrapper (see below
point 1) ?

1. Let's have a wrapper org.apache.sysml.api.Datasets which has following
methods:
a. load_*() similar to http://scikit-learn.org/stable/datasets/#toy-datasets.
These methods download the toy dataset (if not already downloaded), puts it
in a configurable tmp directory and pushes it to underlying FS.
b. make_*() similar to
http://scikit-learn.org/stable/datasets/#sample-generators. These methods
call the DML scripts in the folder
https://github.com/apache/incubator-systemml/tree/master/scripts/datagen
using MLContext/JMLC.

The load_*() methods helps creates interesting demos (but which will likely
run in CP), whereas make_*() will test the scalability of SystemML :)

2. We need to embed all our existing DML scripts into the jar with an
option for the user to provide a custom script directory. This allows the
user to simply import the jar (without downloading the scripts) and run one
of our wrapper org.apache.sysml.api.ml.LogisticRegression.

3. MLPipeline wrappers need to be implemented for the scripts in
https://github.com/apache/incubator-systemml/tree/master/scripts/algorithms.
A sample implementation is available at
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegression.java

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar

[image: Inactive hide details for Deron Eriksson ---04/11/2016 10:46:09
AM---Hi Niketan, I think a separate section for Notebooks is a]Deron
Eriksson ---04/11/2016 10:46:09 AM---Hi Niketan, I think a separate section
for Notebooks is a great idea since, as you

From: Deron Eriksson <deroneriksson@gmail.com>
To: dev@systemml.incubator.apache.org
Date: 04/11/2016 10:46 AM
Subject: Re: Fw: Updating documentation for notebook
------------------------------



Hi Niketan,

I think a separate section for Notebooks is a great idea since, as you
point out, they are hidden under the MLContext section. Also, I really like
the idea of making it as easy as possible for a new user to try out
SystemML in a Notebook. Very good points.

Tutorials for all the algorithms using real-world data would be fantastic.
To me, I would also like to see single-line algorithm invocations (possibly
with generated data) that could be copy/pasted that work with no
modifications needed by the user. This would probably mean either including
small sets of example data in the project, or allowing the reading of data
from URLs.

It would be nice to take something like these 5 commands:
---
$ wget
https://raw.githubusercontent.com/apache/incubator-systemml/master/scripts/datagen/genRandData4Univariate.dml
$ $SPARK_HOME/bin/spark-submit $SYSTEMML_HOME/SystemML.jar -f
genRandData4Univariate.dml -exec hybrid_spark -args 1000000 100 10 1 2 3 4
uni.mtx
$ echo '1' > uni-types.csv
$ echo '{"rows": 1, "cols": 1, "format": "csv"}' > uni-types.csv.mtd
$ $SPARK_HOME/bin/spark-submit $SYSTEMML_HOME/SystemML.jar -f
$SYSTEMML_HOME/algorithms/Univar-Stats.dml -exec hybrid_spark -nvargs
X=uni.mtx TYPES=uni-types.csv STATS=uni-stats.txt
---
and reduce this to 1 command (in the documentation) that the user can
copy/paste and the algorithm runs without any additional work needed by the
user:
---
$ $SPARK_HOME/bin/spark-submit $SYSTEMML_HOME/SystemML.jar -f
$SYSTEMML_HOME/algorithms/Univar-Stats.dml -exec hybrid_spark -nvargs X=
http://www.example.com/uni.mtx TYPES=http://www.example.com/uni-types.csv
STATS=uni-stats.txt
---
If we had this for each of the main algorithms, this would give the users
working examples to start with, which is easier than trying to figure out
this kind of thing by reading the comments in the DML algorithm files.

Deron


On Fri, Apr 8, 2016 at 4:51 PM, Niketan Pansare <npansar@us.ibm.com> wrote:

> Hi all,
>
> As per Luciano's suggestion, I have create a PR with bluemix/datascientist
> tutorial and have flagged it with "Please DONOT push this PR until the
> discussion on dev mailing list is complete." :)
>
> Also, I apologize for incorrect indentation in last email. Here is another
> attempt:
> - How do you want try SystemML ?
> --+ Notebook on cloud
> ----* Bluemix
> ------ + Zeppelin
> ----------- Using Python Kernel
> ------------ + Learn how to write DML program--(something along the lines
> of
>
http://apache.github.io/incubator-systemml/beginners-guide-to-dml-and-pydml.html
> )
> ------------ + Try out pre-packaged algorithms on real-world dataset
> -------------- * Linear Regression
> -------------- * GLM
> -------------- * ALS
> -------------- * ...
> ------------ + Learn how to pass RDD/DataFrame to SystemML
> ------------ + Learn how to use SystemML as MLPipeline
> estimator/transformer
> ------------ + Learn how to use SystemML with existing Python packages
> ----------- Using Scala Kernel
> ------------ + ... similar to Python kernel
> ----------- Using DML Kernel
> ------------ + Learn how to write DML program
> ------ + Jupyter
> --------- Using Python Kernel
> --------- Using Scala Kernel
> --------- Using DML Kernel
> ----* Data scientist's work bench
> ----* Databricks cloud
> ----* ...
> --+ Notebook on laptop/cluster
> ----* Zeppelin
> ----* Jupyter
> --+ Laptop
> ----* Run SystemML as Standalone jar:
> http://apache.github.io/incubator-systemml/quick-start-guide.html
> ----* Embed SystemML into other Java program:
> http://apache.github.io/incubator-systemml/jmlc.html
> ----* Debug a DML script:
> http://apache.github.io/incubator-systemml/debugger-guide.html
> ----* Spark local mode
> --+ Spark Cluster
> ----* Batch invocation
> ----* Using Spark REPL
> ------+ Learn how to pass RDD/DataFrame to SystemML
> ------+ Learn how to use SystemML as MLPipeline estimator/transformer
> ----* Using PySpark REPL
> ------+ Learn how to pass RDD/DataFrame to SystemML
> ------+ Learn how to use SystemML as MLPipeline estimator/transformer
> --+ Hadoop Cluster
> --+ Spark Cluster on EC2
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> ----- Forwarded by Niketan Pansare/Almaden/IBM on 04/08/2016 04:48 PM
> -----
>
>
>
> *Fw: Updating documentation for notebook*
>
> *Niketan Pansare *
> to:
> dev
> 04/08/2016 01:11 PM
>
>
>
>
> From:
> Niketan Pansare/Almaden/IBM
>
>
>
>
> To:
> dev <dev@systemml.incubator.apache.org>
>
> Hi all,
>
> Here are few suggestions to get things started:
> 1. Have a "Quick Start" (or "Get Started") button besides "Get SystemML"
> on http://systemml.apache.org/.
>
> 2. Then user can go through following questionnaire/bulleted list which
> points people to appropriate link:
> - How do you want try SystemML ?
> + Notebook on cloud
> * Bluemix
> + Zeppelin
> - Using Python Kernel
> + Learn how to write DML program (something along the lines of
>
http://apache.github.io/incubator-systemml/beginners-guide-to-dml-and-pydml.html
> )
> + Try out pre-packaged algorithms on real-world dataset
> * Linear Regression
> * GLM
> * ALS
> * ...
> + Learn how to pass RDD/DataFrame to SystemML (for example:
>
http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html
> )
> + Learn how to use SystemML as MLPipeline estimator/transformer
> + Learn how to use SystemML with existing Python packages
> - Using Scala Kernel
> + ... similar to Python kernel
> - Using DML Kernel
> + Learn how to write DML program
> + Jupyter
> - Using Python Kernel
> - Using Scala Kernel
> - Using DML Kernel
> * Data scientist's work bench
> * Databricks cloud
> * ...
>
> + Notebook on laptop/cluster
> * Zeppelin using docker images (for example:
>
http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html#zeppelin-notebook-example---linear-regression-algorithm
> )
> * Jupyter (for example:
>
http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization
> )
>
> + Laptop
> * Run SystemML as Standalone jar:
> http://apache.github.io/incubator-systemml/quick-start-guide.html
> * Embed SystemML into other Java program:
> http://apache.github.io/incubator-systemml/jmlc.html
> * Debug a DML script:
> http://apache.github.io/incubator-systemml/debugger-guide.html
> * Spark local mode
>
> + Spark Cluster
> * Batch invocation
> * Using Spark REPL
> + Learn how to pass RDD/DataFrame to SystemML
> + Learn how to use SystemML as MLPipeline estimator/transformer
> * Using PySpark REPL
> + Learn how to pass RDD/DataFrame to SystemML
> + Learn how to use SystemML as MLPipeline estimator/transformer
>
> + Hadoop Cluster
> + Spark Cluster on EC2
>
> 3. Add links to SystemML presentations:
> https://www.youtube.com/watch?v=n3JJP6UbH6Q
> https://www.youtube.com/watch?v=6VpiJK8Jydw
> https://www.youtube.com/watch?v=PV-5pZboo4A
> https://www.youtube.com/watch?v=7Zrc5EzOTjg
> https://www.youtube.com/watch?v=3T32lweGxOA
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> ----- Forwarded by Niketan Pansare/Almaden/IBM on 04/08/2016 01:03 PM
-----
>
>
>
> *Re: Updating documentation for notebook*
>
> *Niketan Pansare *
> to:
> dev
> 04/08/2016 10:47 AM
>
> *Please respond to dev*
>
> Thanks Abhishek. I am glad it was helpful :)
>
> Luciano: I agree with you about having a central place for documentation.
> Before cleaning up the tutorial and putting it into our documentation, I
> wanted to:
> 1. Have a discussion about which setup should we use to introduce
> SystemML: command-line standalone, command-line spark/pyspark REPL
> (yarn/standalone), command-line hadoop, scala/python notebook (online
> notebook or require user to setup jupyter/zeppelin).
> 2. Encourage other contributors to come up with intellectually simulating
> tutorial using real world dataset and our existing DML algorithms. This
> means creating JIRAs that people can work on. My repository is only a POC
> to facilitate discussion and will be deleted after that.
> 3. If we do decide to go with online notebook based tutorial, have a
> discussion on how to structure the tutorial:
> - so as to support variety of hosting sites (bluemix / datascientist
> workbench / databricks cloud / azureml / aws / ...).
> - Python or Scala as primary language.
> - Jupyter or Zeppelin as primary notebook.
> - DML kernel or MLContext-based or JMLC-based example.
> - Any standard tutorial (or textbook) we should use as example for
> choosing the dataset.
> - Whether the emphasis should be on learning DML or on building larger
> data pipeline (for example: our MLPipeline-wrapper).
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> *http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar*
> <http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar>
>
> Abhishek Srivastava ---04/08/2016 08:55:58 AM---Great job Niketan , I had
> been searching for such document off late. Regards,
>
> From: Abhishek Srivastava <abhisheksrivastava3@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 04/08/2016 08:55 AM
> Subject: Re: Updating documentation for notebook
> ------------------------------
>
>
>
> Great job Niketan , I had been searching for such document off late.
>
> Regards,
> Abhishek Srivastava
> Fellowship Scholar , IIM Ranchi
> Skype : abhi.sri3
>
> On Fri, Apr 8, 2016 at 6:34 AM, Niketan Pansare <npansar@us.ibm.com>
> wrote:
>
> >
> >
> > Hi all,
> >
> > Here is a suggestion for reducing the barrier to entry for SystemML:
> "Have
> > a detailed quickstart guide/video using Notebook on free (or
trial-based)
> > hosting solution like IBM Bluemix or Data Scientist Workbench".
> >
> > I have create a sample tutorial:
> > *https://github.com/niketanpansare/systemml_tutorial*
> <https://github.com/niketanpansare/systemml_tutorial>
> >
> > Missing items in above tutorial:
> > 1. Create a separate section for Notebook rather than have it hidden
> under
> > MLContext Programming guide (
> >
> >
> *
http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html*
> <
http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html
>
> > ).
> > 2. Add Python Notebooks (This requires attaching both jars and python
> > MLContext to Zeppelin or Jupyter context).
> > 3. Allow users to use jars from our nightly build (see my jupyter
> example)
> > as well as released version (see my zeppelin example).
> > 4. Tutorials for all our algorithms using real world dataset. Example:
> >
> >
> *
https://www.ibm.com/support/knowledgecenter/SSPT3X_2.1.2/com.ibm.swg.im.infosphere.biginsights.tut.doc/doc/tut_Mod_BigR.html*
> <
https://www.ibm.com/support/knowledgecenter/SSPT3X_2.1.2/com.ibm.swg.im.infosphere.biginsights.tut.doc/doc/tut_Mod_BigR.html
>
> > .
> > 5. DML Kernel for Zeppelin (see
> > *https://issues.apache.org/jira/browse/SYSTEMML-542*
> <https://issues.apache.org/jira/browse/SYSTEMML-542>).
> > 6. Other hosting services such as AzureML.
> > 7. Tutorial that shows SystemML's integration with MLPipeline.
> >
> > These missing items can be broken down into relatively small tasks with
> > detailed specification that external contributors can work on. Any
> > thoughts ?
> >
> > Thanks,
> >
> > Niketan Pansare
> > IBM Almaden Research Center
> > E-mail: npansar At us.ibm.com
> > *http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar*
> <http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar>
> >
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message