spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yanbo Liang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-13010) Survival analysis in SparkR
Date Wed, 27 Jan 2016 11:10:39 GMT

    [ https://issues.apache.org/jira/browse/SPARK-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119054#comment-15119054
] 

Yanbo Liang edited comment on SPARK-13010 at 1/27/16 11:09 AM:
---------------------------------------------------------------

There are two issues that we should discuss:
1, Support AFTSurvivalRegression under the SparkR::glm interface or not?
I vote for not, we can have a new function named “survreg”(R have the same function).
“ survreg” also return a PipelineModel like SparkR::glm and can be predicted by Spark::predict.
We should first reorg SparkRWrappers to make it support more models, although
it’s simple.
2, The response variable of the R formula should be pairs for Survival analysis.
Take R survival analysis as examples:
{code}
survreg(Surv(futime, fustat) ~ ecog.ps + rx, ovarian, dist="exponential”)
survfit(coxph(Surv(time,censor)~1), type="aalen”)
{code}
It wraps the pair of “labelCol” and “censorCol” as the response variable of R formula.

So the first step is to make RFormula support pair as label. 
One possible way is to support “cbind” in SparkR, it returns a Scala Tuple2/Vector column
and then make the label of RFormula supports the type of Tuple2/Vector.
GLM with binomial family can also benefit from this feature. But we should also concern about
whether “cbind” conflicts with other functions of SparkR, and we need to keep consistent
semantics.

Looking forward to hear your thoughts. [~mengxr]


was (Author: yanboliang):
There are two issues that we should discuss:
1, Support AFTSurvivalRegression under the SparkR::glm interface or not?
I vote for not, we can have a new function named “survreg”(R have the same function).
“ survreg” also return a PipelineModel like SparkR::glm and can be predicted by Spark::predict.
We should first reorg SparkRWrappers to make it support more models, although
it’s simple.
2, The response variable of the R formula should be pairs for Survival analysis.
Take R survival analysis as examples:
survreg(Surv(futime, fustat) ~ ecog.ps + rx, ovarian, dist="exponential”)
survfit(coxph(Surv(time,censor)~1), type="aalen”)
It wraps the pair of “labelCol” and “censorCol” as the response variable of R formula.

So the first step is to make RFormula support pair as label. 
One possible way is to support “cbind” in SparkR, it returns a Scala Tuple2/Vector column
and then make the label of RFormula supports the type of Tuple2/Vector.
GLM with binomial family can also benefit from this feature. But we should also concern about
whether “cbind” conflicts with other functions of SparkR, and we need to keep consistent
semantics.

Looking forward to hear your thoughts. [~mengxr]

> Survival analysis in SparkR
> ---------------------------
>
>                 Key: SPARK-13010
>                 URL: https://issues.apache.org/jira/browse/SPARK-13010
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, SparkR
>            Reporter: Xiangrui Meng
>            Assignee: Yanbo Liang
>
> Implement a simple wrapper of AFTSurvivalRegression in SparkR to support survival analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message