nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Truslove <>
Subject Re: Apache Nutch being used at National Snow and Ice Data Center: ESIP Federation
Date Wed, 18 Jul 2012 15:01:39 GMT
Chris: message received - I signed up :)

As part of Ruth's Libre project ( we are using
Nutch to find various types of XML data.  We're targeting our search at
geospatial data, and more specifically cryospheric data, but the tools
will remain more broadly applicable.  Specifically we are looking for ESIP
data casts, collection casts, service casts, and ESIP Discovery OpenSearch
services (all the specs are in  These XML documents
and services are characterizable through fairly simple means such as XML

We are currently developing against the Nutch 1.4 tarball distribution
(SVN HEAD was moving quicker than our configuration could keep up with)
and plugging into a standalone Solr instance.

What we have done to date is do some basic configuration work, set the
code up to play nice(-ish) with Eclipse, our internal SVN, and our
CI/deployment system, and write some plugins to help us find our various
XML docs.  We wrote a pair to extract and index the full raw XML content
of the source document, extending the HtmlParseFilter and IndexingFilter
respectively.  XML (and of course HTML too) are just wrapped within a
CDATA section (and CDATA sections within the document are just removed),
and indexed as a big text blob in Solr.  We can do naive text matching and
are having success extracting the URLs of the data feeds we're after.

We also wrote a pair of plugins to keep track of the original index date
of a document (the overarching use case is to determine documents that are
newly found).  We used the ScoringFilter and IndexingFilter for those.

Planned work includes extracting data from the XML before indexing and
using Solr fields more effectively, indexing GCMD keywords, simple spatial
subsetting, and tweaking the ranking algorithms to do a broad search to
identify good sites for deep data searches.

Thanks for the interest - it's been a fun project to work on so far, and
I'm sure we'd be happy to talk more or provide more details.


Ian Truslove
Senior Software Engineer
National Snow and Ice Data Center
University of Colorado
449 UCB,  Boulder, CO 80309

On 7/17/12 9:38 PM, "Mattmann, Chris A (388J)"
<> wrote:

>Hi Markus,
>Great question. I am CC'ing Ruth Duerr and Ian Truslove and Ruth Duerr at
>NSIDC -- maybe they
>can provide more information?
>Ruth, ian, please consider subcribing to and/or
>by sending blank emails to:
>To follow along in the conversation.
>Thanks all!
>On Jul 17, 2012, at 5:27 PM, Markus Jelsma wrote:
>> Cool!
>> What are they exactly doing with Apache Nutch? And, more interesting,
>>what non-standard stuff do they use?
>> Cheers
>> -----Original message-----
>>> From:Mattmann, Chris A (388J) <>
>>> Sent: Tue 17-Jul-2012 21:29
>>> To:
>>> Subject: Apache Nutch being used at National Snow and Ice Data Center:
>>>ESIP Federation
>>> Hey Folks,
>>> Ruth Duerr is presenting at today's ESIP Federation and Discovery
>>> The U.S. National Snow and Ice Data Center (NSIDC) is deploying Apache
>>>Nutch and 
>>> Solr to support discovery of datasets (called "casting").
>>> Really interesting stuff, and worth contacting Ruth and NSIDC if
>>>you're interested.
>>> I'm highly suggesting to to the NSIDC folks to try and contribute any
>>>updates or plugins
>>> they are making to the software upstream here to the ASF.
>>> Thanks!
>>> Cheers,
>>> Chris
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email:
>>> WWW:
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Senior Computer Scientist
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 171-266B, Mailstop: 171-246
>Adjunct Assistant Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA

View raw message