spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yanbo Liang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-17904) Add a wrapper function to install R packages on each executors.
Date Thu, 13 Oct 2016 14:41:21 GMT

     [ https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yanbo Liang updated SPARK-17904:
--------------------------------
    Description: 
SparkR provides {{spark.lappy}} to run local R functions in distributed environment, and {{dapply}}
to run UDF on SparkDataFrame.
If users use third-party libraries inside of the function which was passed into {{spark.lappy}}
or {{dapply}}, they should install required R packages on each executor in advance.
To install dependent R packages on each executors and check it successfully, we can run similar
code like following:
(Note: The code is just for example, not the prototype of this proposal. The detail implementation
should be discussed.)
{code}
rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), install.packages("Matrix”))
test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
collectRDD(rdd)
{code}
It’s cumbersome to run this code snippet each time when you need third-party library, since
SparkR is an interactive analytics tools, users may call lots of libraries during the analytics
session. In native R, users can run {{install.packages()}} and {{library()}} across the interactive
session.
Should we provide one API to wrapper the work mentioned above, then users can install dependent
R packages to each executor easily? 
I propose the following API:
{{spark.installPackages(pkgs, repos)}}
* pkgs: the name of packages. If repos = NULL, this can be set with a local/hdfs path, then
SparkR can install packages from local package archives.
* repos: the base URL(s) of the repositories to use. It can be NULL to install from local
directories.

Since SparkR has its own library directories where to install the packages on each executor,
so I think it will not pollute the native R environment. I'd like to know whether it make
sense, and feel free to correct me if there is misunderstanding.  

  was:
SparkR provides {{spark.lappy}} to run local R functions in distributed environment, and {{dapply}}
to run UDF on SparkDataFrame.
If users use third-party libraries inside of the function which was passed into {{spark.lappy}}
or {{dapply}}, they should install required R packages on each executor in advance.
To install dependent R packages on each executors and check it successfully, we can run similar
code like following:
(Note: The code is just for example, not the prototype of this proposal. The detail implementation
should be discussed.)
{code}
rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), install.packages("Matrix”))
test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
collectRDD(rdd)
{code}
It’s cumbersome to run this code snippet each time when you need third-party library, since
SparkR is an interactive analytics tools, users may call lots of libraries during the analytics
session. In native R, users can run {{install.packages()}} and {{library()}} across the interactive
session.
Should we provide one API to wrapper the work mentioned above, then users can install dependent
R packages to each executor easily? 
I propose the following API:
{{spark.installPackages(pkgs, repos)}}
* pkgs: the name of packages. If repos = NULL, this can be set with a local/hdfs path, then
SparkR can install packages from local package archives.
* repos: the base URL(s) of the repositories to use. It can be NULL to install from local
directories.

Since SparkR has its own library directories where to install the packages on each executor,
so I think it will not pollute the native R environment. I'd like to know whether it's make
sense, and feel free to correct me if there is misunderstanding.  


> Add a wrapper function to install R packages on each executors.
> ---------------------------------------------------------------
>
>                 Key: SPARK-17904
>                 URL: https://issues.apache.org/jira/browse/SPARK-17904
>             Project: Spark
>          Issue Type: New Feature
>          Components: SparkR
>            Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed environment,
and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed into {{spark.lappy}}
or {{dapply}}, they should install required R packages on each executor in advance.
> To install dependent R packages on each executors and check it successfully, we can run
similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The detail implementation
should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party library,
since SparkR is an interactive analytics tools, users may call lots of libraries during the
analytics session. In native R, users can run {{install.packages()}} and {{library()}} across
the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can install
dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a local/hdfs path,
then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to install from local
directories.
> Since SparkR has its own library directories where to install the packages on each executor,
so I think it will not pollute the native R environment. I'd like to know whether it make
sense, and feel free to correct me if there is misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message