spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sooraj <soora...@gmail.com>
Subject Re: mllib kmeans produce 1 large and many extremely small clusters
Date Tue, 11 Aug 2015 06:13:19 GMT
Hi,

The issue is very likely to be in the data or the transformations you
apply, rather than anything to do with the Spark Kmeans API as such. I'd
start debugging by doing a bit of exploratory analysis of the TFIDF
vectors. That is, for instance, plot the distribution (histogram) of the
TFIDF values for each word in the vectors. It's quite possible that the
TFIDF values for most words for most documents are the same in your case,
causing all your 5000 points to crowd around the same region in the
n-dimensional space that they live in.



On 10 August 2015 at 10:28, farhan <farhan_siddiqui@hotmail.com> wrote:

> I tried running mllib k-means with 20newsgroups data set from sklearn. On a
> 5000 document data set I get one cluster with most of the documents and
> other clusters just have handful of documents.
>
> #code
> newsgroups_train =
> fetch_20newsgroups(subset='train',random_state=1,remove=('headers',
> 'footers', 'quotes'))
> small_list = random.sample(newsgroups_train.data,5000)
>
> def get_word_vec(text,vocabulary):
>     word_lst = tokenize_line(text)
>     word_counter = Counter(word_lst)
>     lst = []
>     for v in vocabulary:
>         if v in word_counter:
>             lst.append(word_counter[v])
>         else:
>             lst.append(0)
>     return lst
>
> docsrdd = sc.parallelize(small_list)
> tf = docsrdd.map(lambda x : get_word_vec(x,vocabulary))
> idf = IDF().fit(tf)
> tfidf = idf.transform(tf)
> clusters = KMeans.train(tfidf, 20)
>
> #documents in each cluster, using clusters.predict(x)
> Counter({0: 4978, 11: 3, 9: 2, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8:
> 1, 10: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1})
>
>
> Please Help !
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-kmeans-produce-1-large-and-many-extremely-small-clusters-tp24189.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message