spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-2012) PySpark StatCounter with numpy arrays
Date Sat, 02 Aug 2014 19:35:12 GMT

     [ https://issues.apache.org/jira/browse/SPARK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Rosen resolved SPARK-2012.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1.0
         Assignee: Jeremy Freeman

> PySpark StatCounter with numpy arrays
> -------------------------------------
>
>                 Key: SPARK-2012
>                 URL: https://issues.apache.org/jira/browse/SPARK-2012
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 1.0.0
>            Reporter: Jeremy Freeman
>            Assignee: Jeremy Freeman
>            Priority: Minor
>             Fix For: 1.1.0
>
>
> In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy arrays just
as with an RDD of scalars, which was very useful (e.g. for computing stats on a set of vectors
in ML analyses). In 1.0.0 this broke because the added functionality for computing the minimum
and maximum, as implemented, doesn't work on arrays.
> I have a PR ready that re-enables this functionality by having StatCounter use the numpy
element-wise functions "maximum" and "minimum", which work on both numpy arrays and scalars
(and I've added new tests for this capability). 
> However, I realize this adds a dependency on NumPy outside of MLLib. If that's not ok,
maybe it'd be worth adding this functionality as a util within PySpark MLLib?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message