lucene-openrelevance-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Durusau <patr...@durusau.net>
Subject Corpus suggestion
Date Wed, 08 Jun 2011 09:09:12 GMT
Greetings!

When I stumbled across this project I read the background material and 
the notes about the difficulties with obtaining the materials from NIST 
(licensing issues).

I tracked down the NIST (Tipster disks) materials which were described 
as follows:

> The documents in the test collection are varied in style, size and 
> subject domain. The first disk contains material from the Wall Street 
> Journal, 
> <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_WSJsample> 
> (1986, 1987, 1988, 1989), the AP Newswire 
> <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_APsample> 
> (1989), the Federal Register 
> <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_FRsample> 
> (1989), information from Computer Select 
> <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_CSsample> 
> disks (Ziff-Davis Publishing) and short abstracts from the Department 
> of Energy 
> <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_DOEsample>. 
> The second disk contains information from the same sources, but from 
> different years. The third disk contains more information from the 
> Computer Select disks, plus material from the San Jose Mercury News 
> <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3D_SJMercurysample> 
> (1991), more AP newswire (1990) and about 250 megabytes of formatted 
> U.S. Patents 
> <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3D_USPatent>. The 
> format of all the documents is relatively clean and easy to use, with 
> SGML-like tags separating documents and document fields. There is no 
> part-of-speech tagging or breakdown into individual sentences or 
> paragraphs as the purpose of this collection is to test retrieval 
> against real-world data. 

But are those really representative of all the documents that are 
encountered in a modern searching context?

Considering the prevalence of email, for example, as compared to the 
Wall Street Journal (1986-1989), I suspect email archives should be a 
major part of any such corpus.

Thinking along those lines made me realize that the Apache Foundation 
already has:

1) Email list archives
2) Source code
3) Program documentation
4) Wikis
5) Webpages

all of which fall within the expertise of Apache participants to judge 
relevance. (Unlike some of the TREC collections, such as the tobacco 
settlement documents which would require legal expertise.)

There are other text collections that could be used but it occurred to 
me that starting close to home might avoid some of the licensing issues 
that were troublesome in the past.

Apologies if this has been discussed before but I was unable to find 
email archives for this project.

Hope everyone is having a great day!

Patrick

-- 
Patrick Durusau
patrick@durusau.net
Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message