spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "AbderRahman Sobh (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-17950) Match SparseVector behavior with DenseVector
Date Tue, 25 Oct 2016 00:13:59 GMT

     [ https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

AbderRahman Sobh updated SPARK-17950:
-------------------------------------
    Description: 
What changes were proposed in this pull request?

Simply added the __getattr__ to SparseVector that DenseVector has, but calls to a SciPy sparse
representation instead of storing a vector all the time in self.array

This allows for use of functions on the values of an entire SparseVector in the same direct
way that users interact with DenseVectors.
i.e. you can simply call SparseVector.mean() to average the values in the entire vector.

Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However,
the majority of useful functions (sums, means, max, etc.) are available to both packages anyways.

How was this patch tested?

Manual testing on local machine.
Passed ./python/run-tests
No UI changes.

  was:
Simply added the `__getattr__` to SparseVector that DenseVector has, but calls self.toArray()
instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in the same direct
way that users interact with DenseVectors.
 i.e. you can simply call SparseVector.mean() to average the values in the entire vector.

    Component/s: ML

> Match SparseVector behavior with DenseVector
> --------------------------------------------
>
>                 Key: SPARK-17950
>                 URL: https://issues.apache.org/jira/browse/SPARK-17950
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib, PySpark
>    Affects Versions: 2.0.1
>            Reporter: AbderRahman Sobh
>            Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> What changes were proposed in this pull request?
> Simply added the __getattr__ to SparseVector that DenseVector has, but calls to a SciPy
sparse representation instead of storing a vector all the time in self.array
> This allows for use of functions on the values of an entire SparseVector in the same
direct way that users interact with DenseVectors.
> i.e. you can simply call SparseVector.mean() to average the values in the entire vector.
> Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy.
However, the majority of useful functions (sums, means, max, etc.) are available to both packages
anyways.
> How was this patch tested?
> Manual testing on local machine.
> Passed ./python/run-tests
> No UI changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message