spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: [mllib] Document frequency
Date Mon, 14 Jan 2019 15:49:36 GMT
Yes that seems OK to me.

On Mon, Jan 14, 2019 at 9:40 AM Jatin Puri <purijatin@gmail.com> wrote:
>
> Thanks for the response. So do I go ahead and create a jira ticket?
> Can then send a pull request for the same with the changes.
>
> On Mon, Jan 14, 2019 at 8:18 PM Sean Owen <srowen@gmail.com> wrote:
>>
>> I think that's reasonable. The caller probably has the number of docs
>> already but sure, it's one long and is already computed. This would
>> have to be added to Pyspark too.
>>
>> On Mon, Jan 14, 2019 at 7:56 AM Jatin Puri <purijatin@gmail.com> wrote:
>> >
>> > Hello.
>> >
>> > As part of `org.apache.spark.ml.feature.IDFModel`, I think it is a good idea
to also expose:
>> >
>> > 1. Document frequency vector
>> > 2. Number of documents
>> >
>> > We get the above for free currently and they just need to be exposed as public
val.
>> >
>> > This avoids re-implementation for someone who needs to compute DocumentFrequency
of terms. Currently if someone needs df, then one would need to reverse compute it based on
the idf values obtained.
>> >
>> > Afaik, we dont explicitly provide such a functionality in mllib. And we don't
need to have a separate class, if we can expose it in `IDFModel` itself.
>> >
>> > Does it sound alright?
>> >
>> > Regards,
>> > Jatin
>> >
>
>
>
> --
> Jatin Puri
> http://jatinpuri.com
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message