tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: % of different content types out there on the web
Date Tue, 31 Jan 2012 12:39:12 GMT
We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on 
those two. However, we also explicitly filter out all/most unwanted suffixes. 
We do have a lot of suffixes that we encountered so far.

On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
> (sorry for the cross post)
> 
> Hey Guys,
> 
> I'm trying to find a good citation or estimate (if anyone has done one)
> that estimates the breakout (by % or some other metric) of content types
> out there out the web (with a whole web crawl or a meaningful
> representative dataset) that are non HTML.
> 
> Anyone have any ideas about this?
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-- 
Markus Jelsma - CTO - Openindex

Mime
View raw message