mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Fwd: [SIG-IRList] CFP: INEX 2009 - Clustering Task for collection Selection
Date Wed, 01 Jul 2009 13:06:33 GMT

Begin forwarded message:

> From: Richi Nayak <>
> Date: June 25, 2009 9:47:42 PM EDT
> To: "" <>
> Cc: Richi Nayak <>
> Subject: [SIG-IRList] CFP: INEX 2009 - Clustering Task for  
> collection Selection
> This is a call for participation in XML Clustering Task in INEX  
> 2009. INEX 2009 clustering task is an evaluation forum that provides  
> a platform to measure the performance of clustering methods for  
> collection selection on a huge scale test collection (consisting of  
> a set of documents, their labels, a set of information needs  
> (queries), and the answers to those information needs).
> In the last decade, we have observed a proliferation of approaches  
> for clustering XML documents based on their structure and content.  
> There have been many approaches developed for diverse application  
> domains. Many applications require data objects to be grouped by  
> similarity of content, tags, paths, structure and semantics.
> The clustering task in INEX 2009 evaluates unsupervised machine  
> learning in the context of XML information retrieval. This year we  
> are running a novel evaluation task using manual query assessments  
> from the INEX Ad Hoc track.  The clustering track will explicitly  
> test the Jardine and van Rijsbergen cluster hypothesis (1971), which  
> states that documents that cluster together have a similar relevance  
> to a given query. The task is to split the English Wikipedia  
> collection, 60 Gigabytes in size having around 2.7 million documents  
> in XML format, into disjoint clusters for collection selection.  If  
> the cluster hypothesis holds true, and if suitable clustering can be  
> achieved, then a clustering solution will minimise the number of  
> clusters that need to be searched to satisfy any given query. There  
> are important practical reasons for performing collection selection  
> on a very large corpus. If only a small fraction of clusters (hence  
> documents) need to be searched, then the throughput of an  
> information retrieval system will be greatly improved.
> The INEX XML Wikipedia collection is a marked-up version of the  
> Wikipedia documents.  The mark-up includes, for instance, explicit  
> tagging of named entities.  In order to enable participation with  
> minimal overheads in data-preparation the collection has been pre- 
> processed to provide various representations of the documents.  For  
> instance, a bag-of-words representation of terms and frequent  
> phrases in a document, frequencies of various XML structures in the  
> form of trees, links, named entities, etc.  These various collection  
> representations will be released by the end of this month. As well,  
> the entire document collection is available in XML format and in  
> text-only format if you wish to try different representation  
> approaches. A subset of collection containing about 50,000 documents  
> (of the INEX 2009 corpus) will also be provided, in order to cluster  
> them, for teams that are unable to process such a large data  
> collection.
> The clustering solutions will be evaluated by two means. Firstly,  
> the clustering solution will be evaluated by using the standard  
> criteria such as purity, entropy and F-score to determine the  
> quality of clusters. These evaluation results will be provided  
> online and ongoing along  the same lines as NetFlix, starting from  
> mid-September. Secondly, the clustering solutions will be evaluated  
> to determine the quality of cluster relative to the optimal  
> collection selection goal, given a set of queries.  Better  
> clustering solutions in this context will tend to (on average) group  
> together relevant results for (previously unseen) ad-hoc queries.   
> Real Ad-hoc retrieval queries and their manual assessment results  
> will be utilised in this evaluation.  This novel approach evaluates  
> the clustering solutions relative to a very specific objective -  
> clustering a large document collection in an optimal manner in order  
> to satisfy queries while minimising the search space. Results of  
> second evaluation will be released at the INEX workshop in December.
> The clustering task in INEX 2009 brings together researchers from  
> Information Retrieval, Data Mining, Machine Learning and XML fields.  
> It allows participants to evaluate clustering methods  against a   
> real use case and with significant volumes of data.  The task is  
> designed to facilitate participation with minimal effort by  
> providing not only raw data, but also pre-processed data which can  
> be easily used by existing clustering software.
> Dr Richi Nayak, School of Information Technology,
> Queensland University of Technology, Brisbane, QLD 4001
> Office: GP S537  Phone: 3138 1976
> Email:
> ************************************************
> This SIGIR-IRList message and the SIG-IRList Digest (a moderated IR  
> newsletter), are brought to you by SIGIR, distributed from the  
> University of Sheffield and edited by Raman Chandrasekar ( 
> ).
> o	To submit an article, e-mail
> o	To subscribe, send mail to , with the  
> subject: SUBSCRIBE irlist firstname lastname
> o	To unsubscribe, send mail to, with the  
> subject: UNSUBSCRIBE irlist email
> [The email address is required only if you want to unsubscribe with  
> an address other than the address with which you send the message]
> o	For more info, visit:
> o	Subscribe to a feed of these messages at
> These files are not to be sold or used for commercial purposes.

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message