spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierce Lamb <>
Subject MLlib/kmeans newbie question(s)
Date Sat, 07 Mar 2015 23:20:32 GMT
Hi all,

I'm very new to machine learning algorithms and Spark. I'm follow the
Twitter Streaming Language Classifier found here:

Specifically this code:

Except I'm trying to run it in batch mode on some tweets it pulls out
of Cassandra, in this case 200 total tweets.

As the example shows, I am using this object for "vectorizing" a set of tweets:

object Utils{
  val numFeatures = 1000
  val tf = new HashingTF(numFeatures)

   * Create feature vectors by turning each tweet into bigrams of
   * characters (an n-gram model) and then hashing those to a
   * length-1000 feature vector that we can pass to MLlib.
   * This is a common way to decrease the number of features in a
   * model while still getting excellent accuracy (otherwise every
   * pair of Unicode characters would potentially be a feature).
  def featurize(s: String): Vector = {

Here is my code which is modified from ExaminAndTrain.scala:

 val noSets = => set.mkString("\n"))

    val vectors =

    val numClusters = 5
    val numIterations = 30

    val model = KMeans.train(vectors, numClusters, numIterations)

      for (i <- 0 until numClusters) {
        println(s"\nCLUSTER $i")
        noSets.foreach {
            t => if (model.predict(Utils.featurize(t)) == 1) {

This code runs and each Cluster prints "Cluster 0" "Cluster 1" etc
with nothing printing beneath. If i flip

models.predict(Utils.featurize(t)) == 1 to
models.predict(Utils.featurize(t)) == 0

the same thing happens except every tweet is printed beneath every cluster.

Here is what I intuitively think is happening (please correct my
thinking if its wrong): This code turns each tweet into a vector,
randomly picks some clusters, then runs kmeans to group the tweets (at
a really high level, the clusters, i assume, would be common
"topics"). As such, when it checks each tweet to see if models.predict
== 1, different sets of tweets should appear under each cluster (and
because its checking the training set against itself, every tweet
should be in a cluster). Why isn't it doing this? Either my
understanding of what kmeans does is wrong, my training set is too
small or I'm missing a step.

Any help is greatly appreciated

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message