mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Clustering a large crawl
Date Tue, 05 Jun 2012 22:45:54 GMT
The information lost is another way of saying that the metric is
approximate.

The loss may be good if it is likely to improve generalization to new data
(and if new data are important to you) or may be bad if we are throwing
away real structure.

On Tue, Jun 5, 2012 at 6:35 PM, Pat Ferrel <pat@occamsmachete.com> wrote:

> As one paper puts it "For many datasets, the first several PCs explain
> most of the variance, so that the rest can be disregarded with minimal loss
> of information." Often much of the information lost is equivalent to noise,
> no?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message