www-announce mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sally Khudairi ...@apache.org>
Subject [ANNOUNCE] The Apache Software Foundation Announces Apache Nutch™ v2.0
Date Tue, 10 Jul 2012 12:00:35 GMT
[this announcement is available online at http://s.apache.org/wpS]

Enterprise-scale Open Source search framework used for crawling intranets to global Web indexing.

Forest Hill, MD –10 July 2012– The Apache Software Foundation (ASF), the all-volunteer
developers, stewards, and incubators of nearly 150 Open Source projects and initiatives, today
announced Apache Nutch v2.0.

Apache Nutch is a highly scalable search framework written in Java. It is built on several
Apache projects, including Solr™, Tika™, Hadoop™, and Gora™, among others, for crawling,
a link-graph database, and parsing support for HTML and an array of other document formats.

"Having been at the origin of Open Source superstars such as Apache Hadoop or Apache Tika,
Nutch now catches up with the NoSQL trends and adopts a table-like representation," said Apache
Nutch Vice President Julien Nioche.

Apache Nutch is lauded for its flexible scalability and extensibility, and is the go-to choice
for companies of all sizes, from start-ups and medium sized businesses to large scale organizations.

Under development for nearly two years, Nutch v2.0 covers many use cases, from small crawls
on a single machine to running large scale deployments on Hadoop clusters. "Importantly, Nutch
remains easy to customize thanks to its plugin architecture," explained Nioche. Its highly
modular architecture allows developers to create plug-ins for document parsing, ranking and

"We use Nutch 2.0 for crawling at web scale because it is flexible, well maintained and scales
with Hadoop. Crawling the Web in a robust, scalable and polite way may seem easy in theory.
But in practice, it's not that simple," said Mathijs Homminga, CTO of Kalooga. "The Web is
a wilderness and taming it requires knowledge and expertise on different levels. That's why
we initially chose Nutch: it runs out of the box and contains the results of many, many, many,
lessons-learned. It gave us a head start with crawling. But Nutch is not just a tool; Nutch
is a flexible crawling framework which we can extend and modify to our needs."

Nutch v2.0 offers users an edition focused on large-scale crawling that builds on storage
abstraction (via Apache Gora™) for big data stores such as Apache Accumulo™, Apache Avro™,
Apache Cassandra™, Apache HBase™, Apache HDFS™ (Hadoop Distributed File System), an
in-memory data store, and various high profile SQL stores.

"Our work on Nutch 2.0 gave birth to Apache Gora in the process, which it uses as an abstraction
over the storage backends," added Nioche. "This enhanced architecture makes Nutch not only
more efficient but also easier to integrate with external tools while still solving a large
range of use cases ranging from single servers setups to large-scale Internet crawlers hosted
in the cloud."

"2.0 has long been a community effort and something we've been eagerly anticipating," said
Chris A. Mattmann, Vice President of Apache Tika and Apache OODT. "Nutch 2.0's close integration
with Tika, and in turn, Tika's integration downstream into Apache OODT will undoubtedly bring
all of our communities closer together, and will assist in the big data challenges that those
in our projects regularly see. Nutch 2.0 makes full use of the latest features from Apache
Tika, including its parsing and content detection capabilities."

"The fact that Nutch is implemented on top of Hadoop is essential for us since it allows us
to be scalable in storage and processing --have you ever tried to reparse a billion web pages
in a day?" stated Homminga. "Kalooga currently uses Nutch 2.0 in production, with the HBase
backend, on a 34-node Hadoop cluster. Our current collection holds around a billion web pages,
growing a few hundred million per month. We run indexes on Solr and elasticsearch. Kalooga
offers a visual relevance service for online publishers and Nutch is an essential part of
our technology stack."

"Nutch v2.0 is particularly exciting as it catches up with Apache projects like HBase, Cassandra,
and Accumulo," added Nioche. "The community's response to the earlier versions of v2.0 has
been very encouraging and we hope to see more and more people getting involved."

Availability and Oversight
Apache Nutch software is released under the Apache License v2.0, and is overseen by a self-selected
team of active contributors to the project. A Project Management Committee (PMC) guides the
Project's day-to-day operations, including community development and product releases. Apache
Nutch source code, documentation, mailing lists, and related resources are available at http://nutch.apache.org/

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees nearly one hundred fifty leading
Open Source projects, including Apache HTTP Server — the world's most popular Web server
software. Through the ASF's meritocratic process known as "The Apache Way," more than 400
individual Members and 3,500 Committers successfully collaborate to develop freely available
enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions
are distributed under the Apache License; and the community actively participates in ASF mailing
lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings,
and expo. The ASF is a US 501(3)(c) not-for-profit charity, funded by individual donations
and corporate sponsors including AMD, Basis Technology, Citrix, Cloudera, Facebook, GoDaddy,
Google, IBM, HP, Hortonworks, Huawei, Matt Mullenweg, Microsoft, PSW Group, SpringSource,
and Yahoo!. For more information,
 visit http://www.apache.org/.

"Apache", "Nutch", "Apache Nutch", "Accumulo", "Apache Accumulo", "Avro", "Apache Avro", "Cassandra",
"Apache Cassandra", "Gora", "Apache Gora", "Hadoop", "Apache Hadoop", "HBase", "Apache HBase",
"HDFS", Apache HDFS", "Solr", "Apache Solr", "Tika", "Apache Tika", and "ApacheCon" are trademarks
of The Apache Software Foundation. All other brands and trademarks are the property of their
respective owners.

#  #  #

View raw message