nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ruth Duerr <rdu...@nsidc.org>
Subject Re: Apache Nutch being used at National Snow and Ice Data Center: ESIP Federation
Date Wed, 18 Jul 2012 21:24:44 GMT
Hi Markus,

We are just starting this project.  The real goal is to be able to find new or updated data
casts where ever they are on the web.  We haven't gotten there yet.  We have the concept of
broad but shallow crawl of the web to find interesting sites, and deep crawl of interesting
sites found.

Ruth

Sent from my iPad

On Jul 18, 2012, at 4:18 PM, Markus Jelsma <markus.jelsma@openindex.io> wrote:

> Hi Ian,
> 
> Thanks for sharing your work and experience. Do you use a fixed set of sites and data
formats or extensions for data extraction or can you also discover new data casts on the web?
> 
> Cheers,
> 
> 
> 
> -----Original message-----
>> From:Ian Truslove <ian.truslove@nsidc.org>
>> Sent: Wed 18-Jul-2012 17:03
>> To: Mattmann, Chris A (388J) <chris.a.mattmann@jpl.nasa.gov>; <dev@nutch.apache.org>
<dev@nutch.apache.org>
>> Cc: Ruth Duerr <rduerr@nsidc.org>
>> Subject: Re: Apache Nutch being used at National Snow and Ice Data Center: ESIP Federation
>> 
>> Chris: message received - I signed up :)
>> 
>> As part of Ruth's Libre project (http://nsidc.org/libre/) we are using
>> Nutch to find various types of XML data.  We're targeting our search at
>> geospatial data, and more specifically cryospheric data, but the tools
>> will remain more broadly applicable.  Specifically we are looking for ESIP
>> data casts, collection casts, service casts, and ESIP Discovery OpenSearch
>> services (all the specs are in
>> http://wiki.esipfed.org/index.php/Discovery_Cluster).  These XML documents
>> and services are characterizable through fairly simple means such as XML
>> namespaces.
>> 
>> We are currently developing against the Nutch 1.4 tarball distribution
>> (SVN HEAD was moving quicker than our configuration could keep up with)
>> and plugging into a standalone Solr instance.
>> 
>> What we have done to date is do some basic configuration work, set the
>> code up to play nice(-ish) with Eclipse, our internal SVN, and our
>> CI/deployment system, and write some plugins to help us find our various
>> XML docs.  We wrote a pair to extract and index the full raw XML content
>> of the source document, extending the HtmlParseFilter and IndexingFilter
>> respectively.  XML (and of course HTML too) are just wrapped within a
>> CDATA section (and CDATA sections within the document are just removed),
>> and indexed as a big text blob in Solr.  We can do naive text matching and
>> are having success extracting the URLs of the data feeds we're after.
>> 
>> We also wrote a pair of plugins to keep track of the original index date
>> of a document (the overarching use case is to determine documents that are
>> newly found).  We used the ScoringFilter and IndexingFilter for those.
>> 
>> Planned work includes extracting data from the XML before indexing and
>> using Solr fields more effectively, indexing GCMD keywords, simple spatial
>> subsetting, and tweaking the ranking algorithms to do a broad search to
>> identify good sites for deep data searches.
>> 
>> Thanks for the interest - it's been a fun project to work on so far, and
>> I'm sure we'd be happy to talk more or provide more details.
>> 
>> -Ian.
>> 
>> 
>> 
>> --
>> Ian Truslove
>> Senior Software Engineer
>> National Snow and Ice Data Center
>> University of Colorado
>> 449 UCB,  Boulder, CO 80309
>> 
>> 
>> 
>> 
>> 
>> 
>> On 7/17/12 9:38 PM, "Mattmann, Chris A (388J)"
>> <chris.a.mattmann@jpl.nasa.gov> wrote:
>> 
>>> Hi Markus,
>>> 
>>> Great question. I am CC'ing Ruth Duerr and Ian Truslove and Ruth Duerr at
>>> NSIDC -- maybe they
>>> can provide more information?
>>> 
>>> Ruth, ian, please consider subcribing to dev@nutch.apache.org and/or
>>> user@nutch.apache.org
>>> by sending blank emails to:
>>> 
>>> dev-subscribe@nutch.apache.org
>>> user-subscribe@nutch.apache.org
>>> 
>>> To follow along in the conversation.
>>> 
>>> Thanks all!
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> On Jul 17, 2012, at 5:27 PM, Markus Jelsma wrote:
>>> 
>>>> Cool!
>>>> 
>>>> What are they exactly doing with Apache Nutch? And, more interesting,
>>>> what non-standard stuff do they use?
>>>> 
>>>> Cheers
>>>> 
>>>> -----Original message-----
>>>>> From:Mattmann, Chris A (388J) <chris.a.mattmann@jpl.nasa.gov>
>>>>> Sent: Tue 17-Jul-2012 21:29
>>>>> To: dev@nutch.apache.org
>>>>> Subject: Apache Nutch being used at National Snow and Ice Data Center:
>>>>> ESIP Federation
>>>>> 
>>>>> Hey Folks,
>>>>> 
>>>>> Ruth Duerr is presenting at today's ESIP Federation and Discovery
>>>>> Hackathon:
>>>>> 
>>>>> http://commons.esipfed.org/node/424
>>>>> 
>>>>> The U.S. National Snow and Ice Data Center (NSIDC) is deploying Apache
>>>>> Nutch and 
>>>>> Solr to support discovery of datasets (called "casting").
>>>>> 
>>>>> Really interesting stuff, and worth contacting Ruth and NSIDC if
>>>>> you're interested.
>>>>> I'm highly suggesting to to the NSIDC folks to try and contribute any
>>>>> updates or plugins
>>>>> they are making to the software upstream here to the ASF.
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> Cheers,
>>>>> Chris
>>>>> 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Senior Computer Scientist
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 171-266B, Mailstop: 171-246
>>>>> Email: chris.a.mattmann@nasa.gov
>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 
>>>>> 
>>> 
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:   http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 

Mime
View raw message