mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Beginner questions on clustering & M/R
Date Sat, 17 Jul 2010 23:36:44 GMT
Just speaking heuristically, time series data is very high dimensional.  For
the equities market, you have (at least) daily samples on nearly 10,000
publicly traded stocks.  With only 3 years of data, that gives you 10
million dimensions.  With 30 years of data, things are obviously 10x worse.
 If you include options, futures and commodities things get vastly worse.

Even more problematic, the direct time series data is not translation
invariant.  This means that learning something about the past only teaches
you about the past.  The direct prices are not even magnitude invariant
which is the motive for studying the first-order differences or for using
the log of the prices.

These make any kind of learning approach pretty difficult.

So... the requirement is to decrease the dimensionality somehow.
 Essentially, that means to take those thousands of samples of thousands of
equities and describe them in a much more compact form of some kind.
 Hopefully, this compact representation has important components that are
slowly varying so that predictions made using these components have a
reasonable range into the future.

There are lots of kinds of dimensionality reduction that you can try.  The
three general categories that I would think of off-the-cuff would be
combined frequency/time representations like wavelets (the Gabor transforms
that I mentioned are in this category), SVD techniques which might be able
to decode industry sectors that move together or more general probabilistic
latent variable techniques.  Combinations of these are also plausible.

If you take the SVD stuff in particular, you would start with, say, your
equity data in a matrix.  Each row would represent a different equity and
each column would represent a single time value.  Since equities appear and
disappear, you would have significant numbers of missing observations.  To
deal with the exponential growth phenomena associated with economic entities
in general, I would recommend starting with the log of the price.  If you
take the partial SVD decomposition of this matrix in a fashion suitably
adjusted for the missing values, you will have a left singular matrix that
transforms stocks into the internal representation and a right singular
vector that encodes time-based patterns of price movement.  The SVD
expresses the price movements of individual stocks in terms of linear
combinations of these time-based patterns.

At this level, you can use the system as a method for detecting when a stock
starts to deviate from its cohort.  This might be an interesting signal, for
example, to alert you to examine something more carefully.

If you include various leading economic indicators in your data as well as
simple equity prices, then you begin to get some predictive power.  This is
especially true if you include the leading indicators in a delayed form so
that their predictive effect can be recognized and encoded by the SVD.
 Another trick is to build the SVD initially using just a moderate
indicators in lagged form combined with a few strong indicators of current
conditions.  That will give you right singular vectors that are associated
with general patterns of economic activity.  You can then use those right
singular vectors to derive a matrix of approximate left singular vectors for
the equities of interest.  What you have done at this point is to shoe-horn
the equity prices into a shoe made out of general economic indicators that
are suitably lagged so as for force the model induced by this approximate
SVD to be as predictive as possible.

This is just an outline of how these techniques can be used.  To make
successful models along these lines will take a LOT of detail work.  For
instance, the details of how you express the prices in the beginning is a
big deal.  Another issue is how you express the lagged indicators.  Just
time shifting them is unlikely to be successful.  Convolving with a delay
filter (or several such) that is structured based on expert opinions is
probably much better.  A huge over-arching issue is how to deal with the
fact that if you pick over your data hundreds of times, you may well no
longer be predicting anything but the idiosyncracies of the past due to

I hope this helps.

On Sat, Jul 17, 2010 at 2:11 PM, Florent Empis <>wrote:

> On the SVD part... why would that help?
> Thanks  for your input:)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message