tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Re: % of different content types out there on the web
Date Sun, 29 Jan 2012 16:29:25 GMT
That could be an interesting experiment to do with the commoncrawl dataset
and Tika on Behemoth. Assuming of course that the detection is done
correctly by Tika.  Does anyone have a spare cluster on EC2 ;-) ?

Julien

On 28 January 2012 02:01, Mattmann, Chris A (388J) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> (sorry for the cross post)
>
> Hey Guys,
>
> I'm trying to find a good citation or estimate (if anyone has done one)
> that estimates
> the breakout (by % or some other metric) of content types out there out
> the web
> (with a whole web crawl or a meaningful representative dataset) that are
> non HTML.
>
> Anyone have any ideas about this?
>
> Thanks!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message