lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chris chisolm <>
Subject data extraction architecture
Date Thu, 23 Feb 2012 08:55:22 GMT
I'm relatively new to this field and I have a problem that seems to be
solvable in lots of different ways, and I'm looking for some recommendations
on how to approach a data refining pipeline.

I'm not sure where to look for this type of architecture description.  My
finds so far have been some of the talks on the site.

I have a project that I was tentatively planning to use nutch to crawl a
size set of sites(20,000 to 50,000) extracting maybe 1million documents
in total.

My big problem is where should I do most of my document processing?  There
hooks in nutch, lucene and solr to parse or modify documents.  A large
number of
these documents will have semantically similar information but it's in 100s
if not 1000s of different formats.  I want to get as much of the data as I
into fields so I will be able to do faceted searches.  The product
I'm parsing will mostly have 10-20 or so common pieces of data that I'd
like to gather.

I will have various processing goals:
  - language detection
  - parsing specific data fields that may be represented in many different
  - product description parsing, which I can likely recognize by vocabulary
    with a naive bayes filter

Given the many different formats I expect that I'll take an iterative
to the parsing problem.  So I'll likely crawl it once and successively
my approach.

For the data fields I plan to run them through multiple parsers and score
them based on number of fields parsed and data consistency in the fields

So where should I consider hanging my code in this document processing

thanks for any suggestions!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message