spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jatin Puri <purija...@gmail.com>
Subject Ability to have CountVectorizerModel vocab as empty
Date Wed, 19 Aug 2020 08:11:16 GMT
Hello,

This is wrt
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244

require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF
as necessary.")

Currently, if `CountVectorizer` is trained on an empty dataset results in
the following exception. But it is perfectly valid use case to send it
empty data (or if minDF filters everything).
HashingTF works fine in such scenarios. CountVectorizer doesn't.

Can we remove this constraint? Happy to send a pull-request

java.lang.IllegalArgumentException: requirement failed: The vocabulary
size should be > 0. Lower minDF as necessary.	at
scala.Predef$.require(Predef.scala:224)	at
org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236)	at
org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149)	at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)	at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)	at
scala.collection.Iterator$class.foreach(Iterator.scala:891)	at
scala.collection.AbstractIterator.foreach(Iterator.scala:1334)

Mime
View raw message