spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <>
Subject [jira] [Commented] (SPARK-8999) Support non-temporal sequence in PrefixSpan
Date Sat, 01 Aug 2015 16:05:05 GMT


Xiangrui Meng commented on SPARK-8999:

[~srowen] Thanks for your feedback! PrefixSpan paper has ~2k citations and I can find implementations
in many libraries, e.g., SPMF, R. I think it is fair to say the algorithm is popular in data
mining. The question I had is whether we want to support sequences of itemsets instead of
sequences of items. The former complicates both the API and the implementation. I asked the
author of SPMF for advice. He said without itemset support it is called string mining, which
should be efficiently handled by some other algorithms. So it seems that we should implement
PrefixSpan as in the paper, which supports itemsets.

> Support non-temporal sequence in PrefixSpan
> -------------------------------------------
>                 Key: SPARK-8999
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.5.0
>            Reporter: Xiangrui Meng
>            Assignee: Zhang JiaJin
>            Priority: Critical
>             Fix For: 1.5.0
> In SPARK-6487, we assume that all items are ordered. However, we should support non-temporal
sequences in PrefixSpan. This should be done before 1.5 because it changes PrefixSpan APIs.
> We can use `Array[Array[Int]]` or follow SPMF to use `Array[Int]` and use -1 to mark
itemset boundaries. The latter is more efficient for storage. If we support generic item type,
we can use null.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message