spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath (JIRA)" <>
Subject [jira] [Commented] (SPARK-14503) Scala API for FPGrowth
Date Tue, 14 Feb 2017 02:00:44 GMT


Nick Pentreath commented on SPARK-14503:

Seems {{PrefixSpan}} even takes different input: {{Array[Array[T]]}} vs FPGrowth: {{Array[T]}}.
So it may be tricky to unify.

However we do have the case where e.g. {{QuantileDiscretizer}} returns a {{Bucketizer}} as
{{Model}} from {{fit}}. In that case {{Bucketizer}} can be instantiated directly and independently,
but it could in theory be the case that some other estimator returns a {{Bucketizer}} as its

So we could perhaps think about both {{FPGrowth}} and {{PrefixSpan}} returning an {{AssociationRuleModel}}
from {{fit}}. It could work if the input can be generalized to {{Seq[T]}} where for {{FPGrowth}}
it would be {{Seq[Item]}} and for {{PrefixSpan}} it would be {{Seq[Seq[Item]]}}. The output
of {{transform}} for the model would be the predicted items as above. It would expose {{getFreqItems}}
and {{getAssociationRules}} both returning a {{DataFrame}}.

Is there something in the nature of {{PrefixSpan}} vs {{FPGrowth}} that makes this too difficult?
(I'll have to go read the papers when I get some time!)

But having said that it could be pretty complex to try to support this. If so, unless there's
a compelling argument I'd go for [~josephkb]'s suggestion above, and hide the association
rule class for now (can expose later as needed). Then {{PrefixSpan}} will be totally independent
and return its own {{PrefixSpanModel}} (that may also expose a {{transform}} method that has
similar semantics but different internals).

> Scala API for FPGrowth
> -------------------------------
>                 Key: SPARK-14503
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Joseph K. Bradley
> This task is the first port of spark.mllib.fpm functionality to (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based API, with
details for this class.  The doc could also look ahead to the other fpm classes, especially
if their API decisions will affect FPGrowth.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message