mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: need help on mahout
Date Fri, 09 Nov 2012 18:02:01 GMT
ok. i guess you can try factorization first (against user vs. pages) and
then try to run user factor vectors as predictors with SGD. However it will
not work well if your user/page matrix is too sparse. IMO you need to
prototype this approach in R first before moving to scale to see if you
even can get an acceptable result.


On Fri, Nov 9, 2012 at 9:06 AM, qiaoresearcher <qiaoresearcher@gmail.com>wrote:

> You are absolutely right, but here I have simplified the problem. Content
> similarity can be regarded as one to enrich the features. Features can be
> defined in many ways, here I would like to start with most simple feature:
> visited or not, later on I will add more features if the results can not
> meet expectation
>
> On Fri, Nov 9, 2012 at 10:57 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>
> > sorry you probably meant that anyway. your trained input should be
> labeled
> > by groups and your prediction request input is not labeled.
> >
> > looks like a job for a classification like sgd except visited pages make
> up
> > poor categorical source without looking into their content similarities.
> > On Nov 9, 2012 8:49 AM, "Dmitriy Lyubimov" <dlieu.7@gmail.com> wrote:
> >
> > > if it is supervised classification, your input should contain the
> groups.
> > > te idea is that you extend knowledge hidden in  a smaller perhaps
> expert
> > > labeled dataset to the rest of the universe.
> > > On Nov 9, 2012 8:43 AM, "qiaoresearcher" <qiaoresearcher@gmail.com>
> > wrote:
> > >
> > >> It is a supervised classification problem.
> > >>
> > >> For example, a very simple case:
> > >> say, overall we collect 4 pages from the data set:  { web_page 1
> >  web_page
> > >> 2 web_page 3 web_page 4  }
> > >> then users may have input vectors like:
> > >> user1 [1 1  0  0]
> > >> user2 [1 1  0  0]
> > >> user3 [0 0  1  1]
> > >> user4 [0 0  1  1]
> > >> user5 [0 0  1  1]
> > >>   ...       ....
> > >>
> > >> then whatever classification algorithm mahout has should return
> > >> classification results as
> > >> group 1 { user1, user2}
> > >> group 2 { user3, user4, user5 }
> > >>
> > >>
> > >>
> > >> On Fri, Nov 9, 2012 at 10:29 AM, Sean Owen <srowen@gmail.com> wrote:
> > >>
> > >> > First: what question are you trying to answer from this data? You
> are
> > >> > trying to classify users into what, for what purpose?
> > >> >
> > >> >
> > >> > On Fri, Nov 9, 2012 at 4:20 PM, qiaoresearcher <
> > >> qiaoresearcher@gmail.com
> > >> > >wrote:
> > >> >
> > >> > > Hi All,
> > >> > >
> > >> > > Assume the data is stored in a gzip file which includes many
text
> > >> files.
> > >> > > Within each text file, each line represents an activity of a
user,
> > for
> > >> > > example, a click on a web page.
> > >> > > the text file will look like:
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> ----------------------------------------------------------------------------------
> > >> > > user 1   time11  visiting_web_page11
> > >> > > user 2   time21  visiting_web_page21
> > >> > > user 1   time12  visiting_web_page12
> > >> > > user 1   time13  visiting_web_page13
> > >> > > user 2   time22  visiting_web_page22
> > >> > > user 3   time31  visiting_web_page31
> > >> > > user 1   time14  visiting_web_page14
> > >> > >  ...           ....                ..........
> > >> > >
> > >> > > I am thinking to first construct a web page set like
> > >> > > { visiting_web_page11, visiting_web_page12, visiting_web_page31,
> > >> .......
> > >> > }
> > >> > >
> > >> > > then for each user, we form a vector [ 1  0 0  1 0  0  .....
   ]
> > >>  where
> > >> > > '1' means the user visited that page and 0 means he did not
> > >> > > then use mahout to classify the users based on the vectors
> > >> > >
> > >> > > does mahout has example like this? if not, what kind of java
code
> we
> > >> need
> > >> > > to write to process this task?
> > >> > >
> > >> > > thanks for any suggestions in advance !
> > >> > >
> > >> >
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message