PS of course folding in a considerable amount of new data is not
recommended since when you fold-in, you are not learning any new
semantic space. you are only able to project new documents into
previously learned sematic space and keep measuring similarities to
them in that space.
(which sometimes is good if you want learning to happen in a quite
strict semantic space and consider all new data just from the point of
view of that space).
On Fri, Jun 29, 2012 at 3:39 PM, Dmitriy Lyubimov wrote:
> Yes. the fold-in formula is given in the link you mentioned , formulas
> (2) and (3), of which you probably need only one depending from which
> way you are going. Usually you are folding in new documents (rows of
> U), so you need formula (2) to add new folded-in rows.
>
> Also as comment implies, your new observation vector for document is
> very sparse (as document is unlikely to have all tokens you observed
> in the corpus), so actual computation of (2) may be optimized quite a
> bit if V is indexed row-wise and specific rows of V (which is
> essentially dictionary vectors) can be yanked out very quickly.
>
> -d
>
> On Fri, Jun 29, 2012 at 3:13 PM, Chris Hokamp wrote:
>> Thanks for the quick response. So I will create a new diagonal matrix with
>> the reciprocals of the eigenvalues, and multiply by that. I took a look at
>> the slides (very nice presentation!), but it seems that I won't even need
>> to go this far, as I should be able to take E^(-1) x U^(T) x docvector, and
>> U is available from the output of ssvd. I'm basing this assumption on pages
>> 2/3 of [1].
>>
>> Thanks again for the help,
>> Chris
>>
>> [1]
>> https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.data/SSVD-CLI.pdf
>>
>> On Fri, Jun 29, 2012 at 4:31 PM, Sean Owen wrote:
>>
>>> Well the inverse of a diagonal matrix like that is just going to be a
>>> diagonal matrix holding the reciprocals (1/x) of the values. That much
>>> is easy. But you need to invert more than that to fold in.
>>>
>>> I admit even I don't know the details of the Mahout implementation
>>> you're using, but I imagine the overall principle is the same as the
>>> fold-in described in ... oh wait, look at that, in a preso I posted a
>>> while ago: http://www.slideshare.net/srowen/matrix-factorization Look
>>> at the last few slides; I think it's kind of a useful / simple way to
>>> think of it.
>>>
>>> Sean
>>>
>>> On Fri, Jun 29, 2012 at 10:27 PM, Chris Hokamp
>>> wrote:
>>> > Hi all,
>>> >
>>> > I'm trying to implement Latent Semantic Indexing using the mahout ssvd
>>> > tool, and I'm having trouble understanding how I can use the output of
>>> ssvd
>>> > Mahout to 'fold' new queries (documents) into the LSI space.
>>> Specifically,
>>> > I can't find a way to multiply a vector representing a query by the
>>> inverse
>>> > of the matrix of singular values - I can't find a way to solve for the
>>> > inverse of the diagonal matrix of singular values.
>>> >
>>> > I can generate the output matrices using ssvd, and compare document/term
>>> > vectors using cosine similarity, but I'm stumped when it comes to
>>> folding a
>>> > new document into the space.
>>> >
>>> > Any thoughts or guidance would be appreciated.
>>> >
>>> > Cheers,
>>> > Chris
>>>