spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 雷文昌 (JIRA) <j...@apache.org>
Subject [jira] [Created] (SPARK-15399) Wrong equation in the method of org.apache.spark.mllib.clustering.KMeans
Date Thu, 19 May 2016 06:45:13 GMT
雷文昌 created SPARK-15399:
---------------------------

             Summary: Wrong equation in the method of org.apache.spark.mllib.clustering.KMeans
                 Key: SPARK-15399
                 URL: https://issues.apache.org/jira/browse/SPARK-15399
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 1.6.1
         Environment: windows 64bit
            Reporter: 雷文昌


the equation |a-b|=||a|-|b|| is wrong when a and b are vector. but it is used in the spark-1.6.1.
private[mllib] def findClosest(
      centers: TraversableOnce[VectorWithNorm],
      point: VectorWithNorm): (Int, Double) = {
    var bestDistance = Double.PositiveInfinity
    var bestIndex = 0
    var i = 0
    centers.foreach { center =>
      // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
      // distance computation.
      var lowerBoundOfSqDist = center.norm - point.norm
      lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
      if (lowerBoundOfSqDist < bestDistance) {
        val distance: Double = fastSquaredDistance(center, point)
        if (distance < bestDistance) {
          bestDistance = distance
          bestIndex = i
        }
      }
      i += 1
    }
    (bestIndex, bestDistance)
  }
the center and the point in the source code are vector. and I suggest the code is that
private[mllib] def findClosest(
      centers: TraversableOnce[VectorWithNorm],
      point: VectorWithNorm): (Int, Double) = {
    var bestDistance = Double.PositiveInfinity
    var bestIndex = 0
    var i = 0
    centers.foreach { center =>
      // distance computation.
      val distance: Double = fastSquaredDistance(center, point)
      if (distance < bestDistance) {
        bestDistance = distance
        bestIndex = i
      }
      i += 1
    }
    (bestIndex, bestDistance)
  }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message