spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jatin Puri <purija...@gmail.com>
Subject Re: [mllib] Document frequency
Date Mon, 14 Jan 2019 16:01:24 GMT
Thanks. Created: https://issues.apache.org/jira/browse/SPARK-26616

On Mon, Jan 14, 2019 at 9:19 PM Sean Owen <srowen@gmail.com> wrote:

> Yes that seems OK to me.
>
> On Mon, Jan 14, 2019 at 9:40 AM Jatin Puri <purijatin@gmail.com> wrote:
> >
> > Thanks for the response. So do I go ahead and create a jira ticket?
> > Can then send a pull request for the same with the changes.
> >
> > On Mon, Jan 14, 2019 at 8:18 PM Sean Owen <srowen@gmail.com> wrote:
> >>
> >> I think that's reasonable. The caller probably has the number of docs
> >> already but sure, it's one long and is already computed. This would
> >> have to be added to Pyspark too.
> >>
> >> On Mon, Jan 14, 2019 at 7:56 AM Jatin Puri <purijatin@gmail.com> wrote:
> >> >
> >> > Hello.
> >> >
> >> > As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a
> good idea to also expose:
> >> >
> >> > 1. Document frequency vector
> >> > 2. Number of documents
> >> >
> >> > We get the above for free currently and they just need to be exposed
> as public val.
> >> >
> >> > This avoids re-implementation for someone who needs to compute
> DocumentFrequency of terms. Currently if someone needs df, then one would
> need to reverse compute it based on the idf values obtained.
> >> >
> >> > Afaik, we dont explicitly provide such a functionality in mllib. And
> we don't need to have a separate class, if we can expose it in `IDFModel`
> itself.
> >> >
> >> > Does it sound alright?
> >> >
> >> > Regards,
> >> > Jatin
> >> >
> >
> >
> >
> > --
> > Jatin Puri
> > http://jatinpuri.com
> >
>


-- 
Jatin Puri
http://jatinpuri.com <http://www.jatinpuri.com>

Mime
View raw message