Thanks very much for the clarification and advice! I'm working with the
wikipedia dataset, so I'm using a somewhat 'static' space, and the intent
of the queries is to use the context of a spotted surface form to select
the most similar resource (Wikipedia page) from a set of possible
disambiguations. The surface form will then be linked to the Wikipedia page
that represents it. Therefore, my space doesn't need to contain information
learned from the queries.
Cheers,
CH
On Fri, Jun 29, 2012 at 5:42 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> PS of course folding in a considerable amount of new data is not
> recommended since when you foldin, you are not learning any new
> semantic space. you are only able to project new documents into
> previously learned sematic space and keep measuring similarities to
> them in that space.
>
> (which sometimes is good if you want learning to happen in a quite
> strict semantic space and consider all new data just from the point of
> view of that space).
>
> On Fri, Jun 29, 2012 at 3:39 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> > Yes. the foldin formula is given in the link you mentioned , formulas
> > (2) and (3), of which you probably need only one depending from which
> > way you are going. Usually you are folding in new documents (rows of
> > U), so you need formula (2) to add new foldedin rows.
> >
> > Also as comment implies, your new observation vector for document is
> > very sparse (as document is unlikely to have all tokens you observed
> > in the corpus), so actual computation of (2) may be optimized quite a
> > bit if V is indexed rowwise and specific rows of V (which is
> > essentially dictionary vectors) can be yanked out very quickly.
> >
> > d
> >
> > On Fri, Jun 29, 2012 at 3:13 PM, Chris Hokamp <chris.hokamp@gmail.com>
> wrote:
> >> Thanks for the quick response. So I will create a new diagonal matrix
> with
> >> the reciprocals of the eigenvalues, and multiply by that. I took a look
> at
> >> the slides (very nice presentation!), but it seems that I won't even
> need
> >> to go this far, as I should be able to take E^(1) x U^(T) x docvector,
> and
> >> U is available from the output of ssvd. I'm basing this assumption on
> pages
> >> 2/3 of [1].
> >>
> >> Thanks again for the help,
> >> Chris
> >>
> >> [1]
> >>
> https://cwiki.apache.org/MAHOUT/stochasticsingularvaluedecomposition.data/SSVDCLI.pdf
> >>
> >> On Fri, Jun 29, 2012 at 4:31 PM, Sean Owen <srowen@gmail.com> wrote:
> >>
> >>> Well the inverse of a diagonal matrix like that is just going to be a
> >>> diagonal matrix holding the reciprocals (1/x) of the values. That much
> >>> is easy. But you need to invert more than that to fold in.
> >>>
> >>> I admit even I don't know the details of the Mahout implementation
> >>> you're using, but I imagine the overall principle is the same as the
> >>> foldin described in ... oh wait, look at that, in a preso I posted a
> >>> while ago: http://www.slideshare.net/srowen/matrixfactorization Look
> >>> at the last few slides; I think it's kind of a useful / simple way to
> >>> think of it.
> >>>
> >>> Sean
> >>>
> >>> On Fri, Jun 29, 2012 at 10:27 PM, Chris Hokamp <chris.hokamp@gmail.com
> >
> >>> wrote:
> >>> > Hi all,
> >>> >
> >>> > I'm trying to implement Latent Semantic Indexing using the mahout
> ssvd
> >>> > tool, and I'm having trouble understanding how I can use the output
> of
> >>> ssvd
> >>> > Mahout to 'fold' new queries (documents) into the LSI space.
> >>> Specifically,
> >>> > I can't find a way to multiply a vector representing a query by the
> >>> inverse
> >>> > of the matrix of singular values  I can't find a way to solve for
> the
> >>> > inverse of the diagonal matrix of singular values.
> >>> >
> >>> > I can generate the output matrices using ssvd, and compare
> document/term
> >>> > vectors using cosine similarity, but I'm stumped when it comes to
> >>> folding a
> >>> > new document into the space.
> >>> >
> >>> > Any thoughts or guidance would be appreciated.
> >>> >
> >>> > Cheers,
> >>> > Chris
> >>>
>
