spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From diplomatic Guru <diplomaticg...@gmail.com>
Subject Re: [MLlib] What is the best way to forecast the next month page visit?
Date Tue, 02 Feb 2016 19:44:52 GMT
Hi Jorge,

Unfortunately, I couldn't transform the data as you suggested.

This is what I get:

+---+---------+-------------+
| id|pageIndex|      pageVec|
+---+---------+-------------+
|0.0|      3.0|    (3,[],[])|
|1.0|      0.0|(3,[0],[1.0])|
|2.0|      2.0|(3,[2],[1.0])|
|3.0|      1.0|(3,[1],[1.0])|
+---+---------+-------------+


This is the snippets:

JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
        RowFactory.create(0.0, "PageA", 1.0, 2.0, 3.0),
        RowFactory.create(1.0, "PageB", 4.0, 5.0, 6.0),
        RowFactory.create(2.0, "PageC", 7.0, 8.0, 9.0),
        RowFactory.create(3.0, "PageD", 10.0, 11.0, 12.0)

    ));

    StructType schema = new StructType(new StructField[] {
        new StructField("id", DataTypes.DoubleType, false,
Metadata.empty()),
        new StructField("page", DataTypes.StringType, false,
Metadata.empty()),
        new StructField("Nov", DataTypes.DoubleType, false,
Metadata.empty()),
        new StructField("Dec", DataTypes.DoubleType, false,
Metadata.empty()),
        new StructField("Jan", DataTypes.DoubleType, false,
Metadata.empty()) });

    DataFrame df = sqlContext.createDataFrame(jrdd, schema);

    StringIndexerModel indexer = new
StringIndexer().setInputCol("page").setInputCol("Nov")

.setInputCol("Dec").setInputCol("Jan").setOutputCol("pageIndex").fit(df);

    OneHotEncoder encoder = new
OneHotEncoder().setInputCol("pageIndex").setOutputCol("pageVec");

    DataFrame indexed = indexer.transform(df);

    DataFrame encoded = encoder.transform(indexed);
    encoded.select("id", "pageIndex", "pageVec").show();


Could you please let me know what I'm doing wrong?


PS: My cluster is running Spark 1.3.0, which doesn't support StringIndexer,
OneHotEncoder  but for testing this I've installed the 1.6.0 on my local
machine.

Cheer.


On 2 February 2016 at 10:25, Jorge Machado <jomach@me.com> wrote:

> Hi Guru,
>
> Any results ? :)
>
> On 01/02/2016, at 14:34, diplomatic Guru <diplomaticguru@gmail.com> wrote:
>
> Hi Jorge,
>
> Thank you for the reply and your example. I'll try your suggestion and
> will let you know the outcome.
>
> Cheers
>
>
> On 1 February 2016 at 13:17, Jorge Machado <jomach@me.com> wrote:
>
>> Hi Guru,
>>
>> So First transform your Name pages with OneHotEncoder (
>> https://spark.apache.org/docs/latest/ml-features.html#onehotencoder)
>> then make the same thing for months:
>>
>> You will end with something like:
>> (first tree are the pagename, the other the month,)
>> (0,0,1,0,0,1)
>>
>> then you have your label that is what you want to predict. At the end you
>> will have an LabeledPoint with (10000 -> (0,0,1,0,0,1)) this will represent
>> (10000 -> (PageA, UV_NOV))
>> After that try a regression tree with
>>
>> val model = DecisionTree.trainRegressor(trainingData,
>> categoricalFeaturesInfo, impurity,maxDepth, maxBins)
>>
>>
>> Regards
>> Jorge
>>
>> On 01/02/2016, at 12:29, diplomatic Guru <diplomaticguru@gmail.com>
>> wrote:
>>
>> Any suggestions please?
>>
>>
>> On 29 January 2016 at 22:31, diplomatic Guru <diplomaticguru@gmail.com>
>> wrote:
>>
>>> Hello guys,
>>>
>>> I'm trying understand how I could predict the next month page views
>>> based on the previous access pattern.
>>>
>>> For example, I've collected statistics on page views:
>>>
>>> e.g.
>>> Page,UniqueView
>>> -------------------------
>>> pageA, 10000
>>> pageB, 999
>>> ...
>>> pageZ,200
>>>
>>> I aggregate the statistics monthly.
>>>
>>> I've prepared a file containing last 3 months as this:
>>>
>>> e.g.
>>> Page,UV_NOV, UV_DEC, UV_JAN
>>> ---------------------------------------------------
>>> pageA, 10000,9989,11000
>>> pageB, 999,500,700
>>> ...
>>> pageZ,200,50,34
>>>
>>>
>>> Based on above information, I want to predict the next month (FEB).
>>>
>>> Which alogrithm do you think will suit most, I think linear regression
>>> is the safe bet. However, I'm struggling to prepare this data for LR ML,
>>> especially how do I prepare the X,Y relationship.
>>>
>>> The Y is easy (uniqiue visitors), but not sure about the X(it should be
>>> Page,right). However, how do I plot those three months of data.
>>>
>>> Could you give me an example based on above example data?
>>>
>>>
>>>
>>> Page,UV_NOV, UV_DEC, UV_JAN
>>> ---------------------------------------------------
>>> 1, 10000,9989,11000
>>> 2, 999,500,700
>>> ...
>>> 26,200,50,34
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message