spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth
Date Tue, 14 Feb 2017 02:00:44 GMT

    [ https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864872#comment-15864872
] 

Nick Pentreath commented on SPARK-14503:
----------------------------------------

Seems {{PrefixSpan}} even takes different input: {{Array[Array[T]]}} vs FPGrowth: {{Array[T]}}.
So it may be tricky to unify.

However we do have the case where e.g. {{QuantileDiscretizer}} returns a {{Bucketizer}} as
{{Model}} from {{fit}}. In that case {{Bucketizer}} can be instantiated directly and independently,
but it could in theory be the case that some other estimator returns a {{Bucketizer}} as its
model.

So we could perhaps think about both {{FPGrowth}} and {{PrefixSpan}} returning an {{AssociationRuleModel}}
from {{fit}}. It could work if the input can be generalized to {{Seq[T]}} where for {{FPGrowth}}
it would be {{Seq[Item]}} and for {{PrefixSpan}} it would be {{Seq[Seq[Item]]}}. The output
of {{transform}} for the model would be the predicted items as above. It would expose {{getFreqItems}}
and {{getAssociationRules}} both returning a {{DataFrame}}.

Is there something in the nature of {{PrefixSpan}} vs {{FPGrowth}} that makes this too difficult?
(I'll have to go read the papers when I get some time!)

But having said that it could be pretty complex to try to support this. If so, unless there's
a compelling argument I'd go for [~josephkb]'s suggestion above, and hide the association
rule class for now (can expose later as needed). Then {{PrefixSpan}} will be totally independent
and return its own {{PrefixSpanModel}} (that may also expose a {{transform}} method that has
similar semantics but different internals).

> spark.ml Scala API for FPGrowth
> -------------------------------
>
>                 Key: SPARK-14503
>                 URL: https://issues.apache.org/jira/browse/SPARK-14503
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based API, with
details for this class.  The doc could also look ahead to the other fpm classes, especially
if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message