spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Ability to have CountVectorizerModel vocab as empty
Date Wed, 19 Aug 2020 13:28:11 GMT
I think that's true. You're welcome to open a pull request / JIRA to
remove that requirement.

On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri <purijatin@gmail.com> wrote:
>
> Hello,
>
> This is wrt https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244
>
> require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary.")
>
> Currently, if `CountVectorizer` is trained on an empty dataset results in the following
exception. But it is perfectly valid use case to send it empty data (or if minDF filters everything).
> HashingTF works fine in such scenarios. CountVectorizer doesn't.
>
> Can we remove this constraint? Happy to send a pull-request
>
> java.lang.IllegalArgumentException: requirement failed: The vocabulary size should be
> 0. Lower minDF as necessary.
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236)
> at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149)
> at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
> at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message