spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martin Zapletal (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-3278) Isotonic regression
Date Mon, 16 Mar 2015 09:37:38 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362986#comment-14362986
] 

Martin Zapletal edited comment on SPARK-3278 at 3/16/15 9:37 AM:
-----------------------------------------------------------------

Vladimir,

just to update you on the progress. I was able to complete the isotonic regression with 100M
records (similar data to what you requested, using NASDAQ index prices), but failed with insufficient
memory error with 150M records on my machine. You may be able to run with larger amounts of
data on better machines. 

Pool adjacent violators algorithm can theoretically have linear time complexity, but although
I have used the best algorithm I could find I am not convinced it reaches this efficiency.
I will work on providing evidence.

The biggest issue with the current algorithm is however with the parallelization approach.
Its properties are unfortunately nowhere near linear scalability (linear solution time increase
with linear parallelism increase or constant solution time with linear parallelism increase
and linear problem size increase). This was expected and is caused by the algorithm itself
for the following reasons

1) The algorithm works in two steps. First the computation is distributed to all partitions,
but then gathered and the algorithm is run again on the whole data set. This approach may
leave most of work for the last sequential step and thus gaining very little compared to purely
sequential implementation or even performing worse. That can happen in case where parallel
isotonic regressions return a locally optimal solution that will however have to change for
a global solution in the last step. Another performance drawback in comparison to sequential
processing is the potential need to copy data to each process.
2) It requires the whole dataset to fit into one process’ memory in the last step (or repeated
disk access).

I started looking into the issue and was able to design an iterative algorithm that adressed
both the above issues and performed very close to linear scalability. It however still has
correctness (rounding) issues and will require further research.

Let me know if that helped. In the meantime I will continue working on benchmarks and performance
quantification of the current algorithm as well as on research for potentially more efficient
solutions.


was (Author: zapletal-martin):
Vladimir,

just to update you on the progress. I was able to complete the isotonic regression with 100M
records, but failed with insufficient memory error with 150M records on my machine. You may
be able to run with larger amounts of data on better machines. 

Pool adjacent violators algorithm can theoretically have linear time complexity, but although
I have used the best algorithm I could find I am not convinced it reaches this efficiency.
I will work on providing evidence.

The biggest issue with the current algorithm is however with the parallelization approach.
Its properties are unfortunately nowhere near linear scalability (linear solution time increase
with linear parallelism increase or constant solution time with linear parallelism increase
and linear problem size increase). This was expected and is caused by the algorithm itself
for the following reasons

1) The algorithm works in two steps. First the computation is distributed to all partitions,
but then gathered and the algorithm is run again on the whole data set. This approach may
leave most of work for the last sequential step and thus gaining very little compared to purely
sequential implementation or even performing worse. That can happen in case where parallel
isotonic regressions return a locally optimal solution that will however have to change for
a global solution in the last step. Another performance drawback in comparison to sequential
processing is the potential need to copy data to each process.
2) It requires the whole dataset to fit into one process’ memory in the last step (or repeated
disk access).

I started looking into the issue and was able to design an iterative algorithm that adressed
both the above issues and performed very close to linear scalability. It however still has
correctness (rounding) issues and will require further research.

Let me know if that helped. In the meantime I will continue working on benchmarks and performance
quantification of the current algorithm as well as on research for potentially more efficient
solutions.

> Isotonic regression
> -------------------
>
>                 Key: SPARK-3278
>                 URL: https://issues.apache.org/jira/browse/SPARK-3278
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Martin Zapletal
>             Fix For: 1.3.0
>
>
> Add isotonic regression for score calibration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message