spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soumya Simanta <soumya.sima...@gmail.com>
Subject Creating a feature vector from text before using with MLLib
Date Wed, 01 Oct 2014 21:18:32 GMT
I'm trying to understand the intuition behind the features method that
Aaron used in one of his demos. I believe this feature will just work for
detecting the character set (i.e., language used).

Can someone help ?


def featurize(s: String): Vector = {
  val n = 1000
  val result = new Array[Double](n)
  val bigrams = s.sliding(2).toArray

  for (h <- bigrams.map(_.hashCode % n)) {
    result(h) += 1.0 / bigrams.length
  }

  Vectors.sparse(n, result.zipWithIndex.filter(_._1 != 0).map(_.swap))
}

Mime
View raw message