tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject % of different content types out there on the web
Date Sat, 28 Jan 2012 02:01:26 GMT
(sorry for the cross post)

Hey Guys,

I'm trying to find a good citation or estimate (if anyone has done one) that estimates
the breakout (by % or some other metric) of content types out there out the web
(with a whole web crawl or a meaningful representative dataset) that are non HTML.

Anyone have any ideas about this?

Thanks!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Mime
View raw message