spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jatin Puri <purija...@gmail.com>
Subject Re: [mllib] Document frequency
Date Mon, 14 Jan 2019 15:40:06 GMT
Thanks for the response. So do I go ahead and create a jira ticket?
Can then send a pull request for the same with the changes.

On Mon, Jan 14, 2019 at 8:18 PM Sean Owen <srowen@gmail.com> wrote:

> I think that's reasonable. The caller probably has the number of docs
> already but sure, it's one long and is already computed. This would
> have to be added to Pyspark too.
>
> On Mon, Jan 14, 2019 at 7:56 AM Jatin Puri <purijatin@gmail.com> wrote:
> >
> > Hello.
> >
> > As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a good
> idea to also expose:
> >
> > 1. Document frequency vector
> > 2. Number of documents
> >
> > We get the above for free currently and they just need to be exposed as
> public val.
> >
> > This avoids re-implementation for someone who needs to compute
> DocumentFrequency of terms. Currently if someone needs df, then one would
> need to reverse compute it based on the idf values obtained.
> >
> > Afaik, we dont explicitly provide such a functionality in mllib. And we
> don't need to have a separate class, if we can expose it in `IDFModel`
> itself.
> >
> > Does it sound alright?
> >
> > Regards,
> > Jatin
> >
>


-- 
Jatin Puri
http://jatinpuri.com <http://www.jatinpuri.com>

Mime
View raw message