spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jatin Puri <purija...@gmail.com>
Subject Re: Ability to have CountVectorizerModel vocab as empty
Date Wed, 19 Aug 2020 16:49:01 GMT
Thanks Sean for the quick response.

Logged a Jira: https://issues.apache.org/jira/browse/SPARK-32662

Will send a pull request shortly.

Regards,
Jatin

On Wed, Aug 19, 2020 at 6:58 PM Sean Owen <srowen@gmail.com> wrote:

> I think that's true. You're welcome to open a pull request / JIRA to
> remove that requirement.
>
> On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri <purijatin@gmail.com> wrote:
> >
> > Hello,
> >
> > This is wrt
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244
> >
> > require(vocab.length > 0, "The vocabulary size should be > 0. Lower
> minDF as necessary.")
> >
> > Currently, if `CountVectorizer` is trained on an empty dataset results
> in the following exception. But it is perfectly valid use case to send it
> empty data (or if minDF filters everything).
> > HashingTF works fine in such scenarios. CountVectorizer doesn't.
> >
> > Can we remove this constraint? Happy to send a pull-request
> >
> > java.lang.IllegalArgumentException: requirement failed: The vocabulary
> size should be > 0. Lower minDF as necessary.
> > at scala.Predef$.require(Predef.scala:224)
> > at
> org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236)
> > at
> org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149)
> > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
> > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>


-- 
Jatin Puri
http://jatinpuri.com <http://www.jatinpuri.com>

Mime
View raw message