spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <dbt...@stanford.edu>
Subject Re: Normalizations in MLBase
Date Thu, 12 Jun 2014 21:34:21 GMT
Hi Asian,

I'm not sure if mlbase code is maintained for the current spark
master. The following is the code we use for standardization in my
company. I'm intended to clean up, and submit a PR. You could use it
for now.

  def standardize(data: RDD[Vector]): RDD[Vector] = {
    val summarizer = new RowMatrix(data).computeColumnSummaryStatistics
    val mean = summarizer.mean
    val variance = summarizer.variance

    // The standardization will always densify the output, so the output
    // will be stored in dense vector.
    data.map(x => {
      val n = x.toBreeze.length
      val output = BDV.zeros[Double](n)
      var i = 0
      while(i < n) {
        if(variance(i) == 0) {
          output(i) = Double.NaN
        } else {
          output(i) = (x(i) - mean(i)) / Math.sqrt(variance(i))
        }
        i += 1
      }
      Vectors.fromBreeze(output)
    })
  }

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Jun 12, 2014 at 1:49 AM, Aslan Bekirov <aslanbekirov@gmail.com> wrote:
> Hi DB,
>
> I found a piece of code that uses znorm to normalize data.
>
>
> /**
>  * build training data set from sample and summary data
>  */
>  val train_data = sample_data.map( v =>
>    Array.tabulate[Double](field_cnt)(
>      i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
>    )
>  ).cache
>
> Please make your comments if you find something wrong.
>
> BR,
> Aslan
>
>
>
> On Thu, Jun 12, 2014 at 11:13 AM, Aslan Bekirov <aslanbekirov@gmail.com>
> wrote:
>>
>> Thanks a lot DB.
>>
>> I will try to do Znorm normalization using map transformation.
>>
>>
>> BR,
>> Aslan
>>
>>
>> On Thu, Jun 12, 2014 at 12:16 AM, DB Tsai <dbtsai@stanford.edu> wrote:
>>>
>>> Hi Aslan,
>>>
>>> Currently, we don't have the utility function to do so. However, you
>>> can easily implement this by another map transformation. I'm working
>>> on this feature now, and there will be couple different available
>>> normalization option users can chose.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> -------------------------------------------------------
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Wed, Jun 11, 2014 at 6:25 AM, Aslan Bekirov <aslanbekirov@gmail.com>
>>> wrote:
>>> > Hi All,
>>> >
>>> > I have to normalize a set of values in the range 0-500 to the [0-1]
>>> > range.
>>> >
>>> > Is there any util method in MLBase to normalize large set of data?
>>> >
>>> > BR,
>>> > Aslan
>>
>>
>

Mime
View raw message