nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: Apache Nutch being used at National Snow and Ice Data Center: ESIP Federation
Date Wed, 18 Jul 2012 21:18:15 GMT
Hi Ian,

Thanks for sharing your work and experience. Do you use a fixed set of sites and data formats
or extensions for data extraction or can you also discover new data casts on the web?

Cheers,

 
 
-----Original message-----
> From:Ian Truslove <ian.truslove@nsidc.org>
> Sent: Wed 18-Jul-2012 17:03
> To: Mattmann, Chris A (388J) <chris.a.mattmann@jpl.nasa.gov>; <dev@nutch.apache.org>
<dev@nutch.apache.org>
> Cc: Ruth Duerr <rduerr@nsidc.org>
> Subject: Re: Apache Nutch being used at National Snow and Ice Data Center: ESIP Federation
> 
> Chris: message received - I signed up :)
> 
> As part of Ruth's Libre project (http://nsidc.org/libre/) we are using
> Nutch to find various types of XML data.  We're targeting our search at
> geospatial data, and more specifically cryospheric data, but the tools
> will remain more broadly applicable.  Specifically we are looking for ESIP
> data casts, collection casts, service casts, and ESIP Discovery OpenSearch
> services (all the specs are in
> http://wiki.esipfed.org/index.php/Discovery_Cluster).  These XML documents
> and services are characterizable through fairly simple means such as XML
> namespaces.
> 
> We are currently developing against the Nutch 1.4 tarball distribution
> (SVN HEAD was moving quicker than our configuration could keep up with)
> and plugging into a standalone Solr instance.
> 
> What we have done to date is do some basic configuration work, set the
> code up to play nice(-ish) with Eclipse, our internal SVN, and our
> CI/deployment system, and write some plugins to help us find our various
> XML docs.  We wrote a pair to extract and index the full raw XML content
> of the source document, extending the HtmlParseFilter and IndexingFilter
> respectively.  XML (and of course HTML too) are just wrapped within a
> CDATA section (and CDATA sections within the document are just removed),
> and indexed as a big text blob in Solr.  We can do naive text matching and
> are having success extracting the URLs of the data feeds we're after.
> 
> We also wrote a pair of plugins to keep track of the original index date
> of a document (the overarching use case is to determine documents that are
> newly found).  We used the ScoringFilter and IndexingFilter for those.
> 
> Planned work includes extracting data from the XML before indexing and
> using Solr fields more effectively, indexing GCMD keywords, simple spatial
> subsetting, and tweaking the ranking algorithms to do a broad search to
> identify good sites for deep data searches.
> 
> Thanks for the interest - it's been a fun project to work on so far, and
> I'm sure we'd be happy to talk more or provide more details.
> 
> -Ian.
> 
> 
> 
> --
> Ian Truslove
> Senior Software Engineer
> National Snow and Ice Data Center
> University of Colorado
> 449 UCB,  Boulder, CO 80309
> 
> 
> 
> 
> 
> 
> On 7/17/12 9:38 PM, "Mattmann, Chris A (388J)"
> <chris.a.mattmann@jpl.nasa.gov> wrote:
> 
> >Hi Markus,
> >
> >Great question. I am CC'ing Ruth Duerr and Ian Truslove and Ruth Duerr at
> >NSIDC -- maybe they
> >can provide more information?
> >
> >Ruth, ian, please consider subcribing to dev@nutch.apache.org and/or
> >user@nutch.apache.org
> >by sending blank emails to:
> >
> >dev-subscribe@nutch.apache.org
> >user-subscribe@nutch.apache.org
> >
> >To follow along in the conversation.
> >
> >Thanks all!
> >
> >Cheers,
> >Chris
> >
> >On Jul 17, 2012, at 5:27 PM, Markus Jelsma wrote:
> >
> >> Cool!
> >> 
> >> What are they exactly doing with Apache Nutch? And, more interesting,
> >>what non-standard stuff do they use?
> >> 
> >> Cheers
> >> 
> >> -----Original message-----
> >>> From:Mattmann, Chris A (388J) <chris.a.mattmann@jpl.nasa.gov>
> >>> Sent: Tue 17-Jul-2012 21:29
> >>> To: dev@nutch.apache.org
> >>> Subject: Apache Nutch being used at National Snow and Ice Data Center:
> >>>ESIP Federation
> >>> 
> >>> Hey Folks,
> >>> 
> >>> Ruth Duerr is presenting at today's ESIP Federation and Discovery
> >>>Hackathon:
> >>> 
> >>> http://commons.esipfed.org/node/424
> >>> 
> >>> The U.S. National Snow and Ice Data Center (NSIDC) is deploying Apache
> >>>Nutch and 
> >>> Solr to support discovery of datasets (called "casting").
> >>> 
> >>> Really interesting stuff, and worth contacting Ruth and NSIDC if
> >>>you're interested.
> >>> I'm highly suggesting to to the NSIDC folks to try and contribute any
> >>>updates or plugins
> >>> they are making to the software upstream here to the ASF.
> >>> 
> >>> Thanks!
> >>> 
> >>> Cheers,
> >>> Chris
> >>> 
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Chris Mattmann, Ph.D.
> >>> Senior Computer Scientist
> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> Office: 171-266B, Mailstop: 171-246
> >>> Email: chris.a.mattmann@nasa.gov
> >>> WWW:   http://sunset.usc.edu/~mattmann/
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Adjunct Assistant Professor, Computer Science Department
> >>> University of Southern California, Los Angeles, CA 90089 USA
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> 
> >>> 
> >
> >
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Chris Mattmann, Ph.D.
> >Senior Computer Scientist
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 171-266B, Mailstop: 171-246
> >Email: chris.a.mattmann@nasa.gov
> >WWW:   http://sunset.usc.edu/~mattmann/
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Adjunct Assistant Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 

Mime
View raw message