lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Searle <dave.sea...@magicalia.com>
Subject Re: faceted search with job title
Date Wed, 21 Jul 2010 23:32:27 GMT
You could grab your xpath rules from a db too. This is what I did for a price scrapping app
I did a while ago. New sites were added with a set of rules using a web ui  You could certainly
use regex of course, but IMO that's more complex than writing a simple xpath. Using JavaScript
or some dom traversal code, you could quite easily create a click and point tool to generate
rules very simply and quickly. 

On 21 Jul 2010, at 23:10, Savannah Beckett <savannah_beckett30@yahoo.com> wrote:

> And I will have to recompile the dom or sax code each time I add a job board for 
> crawling.  Regex patten is only a string which can be stored in a text file or 
> db, and retrieved based on the job board.  What do you think?
> 
> 
> 
> 
> ________________________________
> From: "Nagelberg, Kallin" <KNagelberg@globeandmail.com>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Sent: Wed, July 21, 2010 10:39:32 AM
> Subject: RE: faceted search with job title
> 
> Yeah you should definitely just setup a custom parser for each site.. should be 
> easy to extract title using groovy's xml parsing along with tagsoup for sloppy 
> html. If you can't find the pattern for each site leading to the job title how 
> can you expect solr to? Humans have the advantage here :P
> 
> -Kallin Nagelberg
> 
> -----Original Message-----
> From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
> Sent: Wednesday, July 21, 2010 12:20 PM
> To: solr-user@lucene.apache.org
> Cc: dave.searle@magicalia.com
> Subject: Re: faceted search with job title
> 
> mmm...there must be better way...each job board has different format.  If there 
> are constantly new job boards being crawled, I don't think I can manually look 
> for specific sequence of tags that leads to job title.  Most of them don't even 
> have class or id.  There is no guarantee that the job title will be in the title 
> 
> tag, or header tag.  Something else can be in the title.  Should I do this in a 
> class that extends IndexFilter in Nutch?
> Thanks. 
> 
> 
> 
> 
> ________________________________
> From: Dave Searle <dave.searle@magicalia.com>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Sent: Wed, July 21, 2010 8:42:55 AM
> Subject: RE: faceted search with job title
> 
> You'd probably need to do some post processing on the pages and set up rules for 
> 
> each website to grab that specific bit of data. You could load the html into an 
> xml parser, then use xpath to grab content from a particular tag with a class or 
> 
> id, based on the particular website
> 
> 
> 
> -----Original Message-----
> From: Savannah Beckett [mailto:savannah_beckett30@yahoo.com] 
> Sent: 21 July 2010 16:38
> To: solr-user@lucene.apache.org
> Subject: faceted search with job title
> 
> Hi,
>   I am currently using nutch to crawl some job pages from job boards.  They are 
> in my solr index now.  I want to do faceted search with the job titles.  How?  
> The job titles can be in any locations of the page, e.g. title, header, 
> content...   If I use indexfilter in Nutch to search the content for job title, 
> there are hundred of thousands of job titles, I can't hard code them all.  Do 
> you have a better idea?  I think I need the job title in a separate field in the 
> 
> 
> index to make it work with solr faceted search, am I right?
> Thanks.
> 
> 

Mime
View raw message