lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Savannah Beckett <savannah_becket...@yahoo.com>
Subject Re: faceted search with job title
Date Wed, 21 Jul 2010 22:08:57 GMT
And I will have to recompile the dom or sax code each time I add a job board for 
crawling.  Regex patten is only a string which can be stored in a text file or 
db, and retrieved based on the job board.  What do you think?




________________________________
From: "Nagelberg, Kallin" <KNagelberg@globeandmail.com>
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
Sent: Wed, July 21, 2010 10:39:32 AM
Subject: RE: faceted search with job title

Yeah you should definitely just setup a custom parser for each site.. should be 
easy to extract title using groovy's xml parsing along with tagsoup for sloppy 
html. If you can't find the pattern for each site leading to the job title how 
can you expect solr to? Humans have the advantage here :P

-Kallin Nagelberg

-----Original Message-----
From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
Sent: Wednesday, July 21, 2010 12:20 PM
To: solr-user@lucene.apache.org
Cc: dave.searle@magicalia.com
Subject: Re: faceted search with job title

mmm...there must be better way...each job board has different format.  If there 
are constantly new job boards being crawled, I don't think I can manually look 
for specific sequence of tags that leads to job title.  Most of them don't even 
have class or id.  There is no guarantee that the job title will be in the title 

tag, or header tag.  Something else can be in the title.  Should I do this in a 
class that extends IndexFilter in Nutch?
Thanks. 




________________________________
From: Dave Searle <dave.searle@magicalia.com>
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set up rules for 

each website to grab that specific bit of data. You could load the html into an 
xml parser, then use xpath to grab content from a particular tag with a class or 

id, based on the particular website



-----Original Message-----
From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in the 


index to make it work with solr faceted search, am I right?
Thanks.


      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message