lucene-openrelevance-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Corpus suggestion
Date Thu, 09 Jun 2011 12:10:10 GMT

On Jun 8, 2011, at 5:09 AM, Patrick Durusau wrote:

> Greetings!
> 
> When I stumbled across this project I read the background material and the notes about
the difficulties with obtaining the materials from NIST (licensing issues).
> 
> I tracked down the NIST (Tipster disks) materials which were described as follows:
> 
>> The documents in the test collection are varied in style, size and subject domain.
The first disk contains material from the Wall Street Journal, <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_WSJsample>
(1986, 1987, 1988, 1989), the AP Newswire <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_APsample>
(1989), the Federal Register <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_FRsample>
(1989), information from Computer Select <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_CSsample>
disks (Ziff-Davis Publishing) and short abstracts from the Department of Energy <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_DOEsample>.
The second disk contains information from the same sources, but from different years. The
third disk contains more information from the Computer Select disks, plus material from the
San Jose Mercury News <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3D_SJMercurysample>
(1991), more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3D_USPatent>.
The format of all the documents is relatively clean and easy to use, with SGML-like tags separating
documents and document fields. There is no part-of-speech tagging or breakdown into individual
sentences or paragraphs as the purpose of this collection is to test retrieval against real-world
data. 
> 
> But are those really representative of all the documents that are encountered in a modern
searching context?

No.  Well, the newswire ones probably still emulate current newswire and the Fed Register
probably still simulates that kind of stuff, albeit w/ updated language.

> 
> Considering the prevalence of email, for example, as compared to the Wall Street Journal
(1986-1989), I suspect email archives should be a major part of any such corpus.
> 
> Thinking along those lines made me realize that the Apache Foundation already has:
> 
> 1) Email list archives

Yep, I have these posted up on S3.  http://asf-mail-archives.s3-website-us-east-1.amazonaws.com/

> 2) Source code
> 3) Program documentation
> 4) Wikis
> 5) Webpages
> 
> all of which fall within the expertise of Apache participants to judge relevance. (Unlike
some of the TREC collections, such as the tobacco settlement documents which would require
legal expertise.)
> 
> There are other text collections that could be used but it occurred to me that starting
close to home might avoid some of the licensing issues that were troublesome in the past.

Definitely.  What we need is a way of gathering judgments as well as collecting queries, etc.

I think we should also take the public NIST ones and host all of them here as well, along
w/ judgments and queries so that it all just works seamlessly.

> 
> Apologies if this has been discussed before but I was unable to find email archives for
this project.

http://www.lucidimagination.com/search/?q=#/p:openrelevance

> 
> Hope everyone is having a great day!
> 
> Patrick
> 
> -- 
> Patrick Durusau
> patrick@durusau.net
> Chair, V1 - US TAG to JTC 1/SC 34
> Convener, JTC 1/SC 34/WG 3 (Topic Maps)
> Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
> Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)
> 
> Another Word For It (blog): http://tm.durusau.net
> Homepage: http://www.durusau.net
> Twitter: patrickDurusau
> 

--------------------------
Grant Ingersoll




Mime
View raw message