another piece of information is that according to investigation we
did, using q=0 seems to underestimate singular values somewhat and
produces a little better decay than optimal values would have. which
means variance retained would err on the optimistic side a bit
(showing a little bit more variance retained than actual).
also i think retaining 99% is reasonable with big corpuses. this
desirable threshold could probably be reduced a little bit to
accomodate pragmatic resource capacity.
On Fri, Aug 10, 2012 at 2:42 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> note that to estimate variance retained approximately only, you
> probably don't need to run it with q=1 so you can run with q=0 and not
> request either V or U but singular values only. That will reduce
> running time dramatically (perhaps up to 25 times faster compared to
> with q=1 and U and V).
>
> On Fri, Aug 10, 2012 at 2:31 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>> With PCA, there's a metric, something called "variance retained".
>>
>> One idea of mine to estimate k is described in footnote discussion on
>> page 5. While it is not possible to compute "PCA variance retained"
>> metric exactly with an application of a thin SVD (the metric assumes
>> use of a full rank SVD) it is possible to infer upper estimate for k
>> given target variance retained (say, 99%) or try some sort of
>> polynomialy approximated value for the sum of all singular values
>> given visible decay. Probably requires some simple code in R or matlab
>> to get reasonable estimate.
>>
>> This technique requires running PCA one time and then estimate
>> sufficient k given singular values produced on your corpus. If the
>> action is repetetive and corpus is not changing drastically, you may
>> infer if you spending too much (or too little) on k for future uses.
>>
>> But pragmatically i just use the best k my cluster can compute in the
>> time i need. my corpus is relatively small and i don't run full corpus
>> run too often so i can afford some time spent.
>>
>> On Fri, Aug 10, 2012 at 2:14 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>>> The builtin PCA option is one reason I wanted to try SSVD. Building the test
was to make sure I understood the external matrix operations before diving in. I expect one
primary decision is how to choose k for reduction. I'm hoping to get some noise rejection
so not using it for reduced matrix size so much. We are starting with m = 500,000 and a million
or so docs. We get many dups and low value docs in a small web crawl, so lots of noise.
>>>
>>> You mention in your paper:
>>>
>>> "The valueof k + p directly impacts running time and memory requirements.
>>> k+p=500 is probably more than reasonable. Typically k + p
>>> is taken within range 20…200"
>>>
>>> So I guess we might start with
>>> p 15 (default)
>>> q 1
>>> k 200
>>>
>>> Is there any use in hand inspecting the eigen vectors before choosing the final
k? If so do you get those by choosing k nearly = m or is something like k = 1000 (or ?) good
enough to for inspection?
>>>
>>> On Aug 10, 2012, at 12:53 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>>
>>> BTW if you really are trying to reduce dimensionality, you may want to
>>> consider pca option with SSVD, that [i think] will provide with much
>>> better preserved data variance then just clean SVD (i.e. essentially
>>> run a PCA space transformation on your data rather than just SVD)
>>>
>>> d
>>>
>>> On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>>>> Got it. Well on to some real and much larger data sets then…
>>>>
>>>> On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
wrote:
>>>>
>>>> i think actually Mahout's Lanczos requires external knowledge of input
>>>> size too, in part for similar reasons. SSVD doesn't because it doesn't
>>>> have "other" reasons to know input size but fundamental assumption
>>>> rank(input)>=rank(thin SVD) still stands about the input but the
>>>> method doesn't have a goal of verifying it explicitly (which would be
>>>> kind of hard), and instead either produces 0 eigenvectors or runs into
>>>> block deficiency.
>>>>
>>>> It is however hard to assert whether block deficiency stemmed from
>>>> input size deficiency vs. split size deficiency, and neither of
>>>> situations is typical for a reallife SSVD applications, hence error
>>>> message is somewhat vague.
>>>>
>>>> On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
wrote:
>>>>> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
>>>>> not a method pecularity.
>>>>>
>>>>> The only reason the solution doesn't warn you explicitly is because
>>>>> DistributedRowMatrix format, which is just a sequence file of rows,
>>>>> would not provide us with an easy way to verify what m actually is
>>>>> before it actually iterates over it and runs into block size
>>>>> deficiency. So if you now m as an external knowledge, it is easy to
>>>>> avoid being trapped by block height defiicency.
>>>>>
>>>>>
>>>>> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <pat.ferrel@gmail.com>
wrote:
>>>>>> This is only a test with some trivially simple data. I doubt there
are any splits and yes it could easily be done in memory but that is not the purpose. It is
based on testKmeansDSVD2, which is in
>>>>>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>>>>>> I've attached the modified and running version with testKmeansDSSVD
>>>>>>
>>>>>> As I said I don't think this is a real world test. It tests that
the code runs, and it does. Getting the best results is not part of the scope. I just thought
if there was an easy answer I could clean up the parameters for SSVDSolver.
>>>>>>
>>>>>> Since it is working I don't know that it's worth the effort unless
people are likely to run into this with larger data sets.
>>>>>>
>>>>>> Thanks anyway.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
wrote:
>>>>>>
>>>>>> It happens because of internal constraints stemming from blocking.
it
>>>>>> happens when a split of A (input) has less than (k+p) rows at which
>>>>>> point blocks are too small (or rather, to short) to successfully
>>>>>> perform a QR on .
>>>>>>
>>>>>> This also means, among other things, k+p cannot be more than your
>>>>>> total number of rows in the input.
>>>>>>
>>>>>> It is also possible that input A is way too wide or k+p is way too
big
>>>>>> so that an arbitrary split does not fetch at least k+p rows of A,
but
>>>>>> in practice i haven't seen such cases in practice yet. If that
>>>>>> happens, there's an option to increase minSplitSize (which would
>>>>>> undermine MR mappers efficiency somewhat). But i am pretty sure
it is
>>>>>> not your case.
>>>>>>
>>>>>> But if your input is shorter than k+p, then it is a case too small
for
>>>>>> SSVD. in fact, it probably means you can solve test directly in memory
>>>>>> with any solver. You can still use SSVD with k=m and p=0 (I think)
in
>>>>>> this case and get exact (nonreduced rank) decomposition equivalent
>>>>>> with no stochastic effects, but that is not what it is for really.
>>>>>>
>>>>>> Assuming your input is m x n, can you tell me please what your m,
n, k
>>>>>> and p are?
>>>>>>
>>>>>> thanks.
>>>>>> D
>>>>>>
>>>>>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pat.ferrel@gmail.com>
wrote:
>>>>>>> There seems to be some internal constraint on k and/or p, which
is making a test difficult. The test has a very small input doc set and choosing the wrong
k it is very easy to get a failure with this message:
>>>>>>>
>>>>>>> java.lang.IllegalArgumentException: new m can't be less than
n
>>>>>>> at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>>>>>
>>>>>>> I have a working test but I had to add some docs to the test
data and have tried to reverse engineer the value for k (desiredRank). I came up with the
following but I think it is only an accident that it works.
>>>>>>>
>>>>>>> int p = 15; //default value for CLI
>>>>>>> int desiredRank = sampleData.size()  p  1;//number of docs
 p  1, ?????? not sure why this works
>>>>>>>
>>>>>>> This seems likely to be an issue only because of the very small
data set and the relationship of rows to columns to p to k. But for the purposes of creating
a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the
dimensions of the tiny data set it would help.
>>>>>>>
>>>>>>> This test is derived from a nonactive SVD test but I'd be up
for cleaning it up and including it as an example in the working but nonactive tests. I also
fixed a couple trivial bugs in the nonactive Lanczos tests for what it's worth.
>>>>>>>
>>>>>>>
>>>>>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
wrote:
>>>>>>>
>>>>>>> Reading "overview and usage" doc linked on that page
>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>>>>>> should help to clarify outputs and usage.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
wrote:
>>>>>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pat.ferrel@gmail.com>
wrote:
>>>>>>>>> Quoth Grant Ingersoll:
>>>>>>>>>> To put this in bin/mahout speak, this would look
like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>>>>>>
>>>>>>>>>> bin/mahout svd (original > svdOut)
>>>>>>>>>> bin/mahout cleansvd ...
>>>>>>>>>> bin/mahout transpose svdOut > svdT
>>>>>>>>>> bin/mahout transpose original > originalT
>>>>>>>>>> bin/mahout matrixmult originalT svdT > newMatrix
>>>>>>>>>> bin/mahout kmeans newMatrix
>>>>>>>>>
>>>>>>>>> I'm trying to create a test case from testKmeansDSVD2
to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>>>>>>
>>>>>>>> No
>>>>>>>>
>>>>>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS?
Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through
the EigenVerificationJob?
>>>>>>>> no
>>>>>>>>
>>>>>>>> SSVD is SVD, meaning it produces U and V with no further
need to clean that
>>>>>>>>
>>>>>>>>> I get errors when I do so trying to figure out if I'm
on the wrong track.
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
