lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Archambault" <tarchamba...@bangordailynews.net>
Subject Re: Simple Faceted Searching out of the box
Date Sat, 23 Sep 2006 00:47:40 GMT
Amen Hoss. I appreciated you explaining in terms of what I can understand,
"jobs." Makes it easier for me to learn.

What you are saying is right-on with what I'm trying to understand. Right
now I have simple Lucene Indexes that  basically re-created once daily and
that simply isn't doing the job for about 30% of my content.

I'm learning a framework called Model-Glue Unity that uses Reactor which is
an ORM. I'll have to think of how I might be able to make that work.  But as
you say, not all relationships are equal.

For indexing news articles for instance, I want the article, all reader
comments, photos, links, multimedia files associated with the article to be
indexed together as one entity so that if Chris Hostetter commented on the
"high cost of heating oil in Maine" article, I can find the article by
searching on your name, etc....

Have a great weekend and thanks for all the help.

Tim



On 9/22/06, Chris Hostetter <hossman_lucene@fucit.org> wrote:
>
>
> : I've been talking with other papers about Solr and I think what bothers
> many
> : is that there a is a deposit of information in a structured database
> here
> : [named A], then we have another set of basically the same data over here
> : [named B] and they don't understand why they have to manage to different
> : sets of data [A & B] that are virtually the same thing.  Many foresee a
>
> The big issue is that while "SQL Schemas" may be fairly consistent, uses
> of those schemas can be very different ... there is no clear cut way to
> look at an arbitrary schema and know how far down a chain of foreign key
> relationships you should go and still consider the data you find relevant
> to the item you started with (from a search perspective) ... ORM tools
> tend to get arround this by Lazy-Loading .. if your front end application
> starts with a single jobPostId and then asks for the name of the city it's
> mapped to, or the named of the company it's mapped to it will dynamicaly
> fetch the "Company" object from teh company table, or maybe it will only
> fetch the single companyName field ... but when building a search index
> you can't get that lazy evaluation -- you have to proactively fetch that
> data in advance, which means you have to know in advance how far down the
> rabbit hole you want to go.
>
> not all relationships are equal either: you might have a "Skills" table
> and a many-to-many relationship between JobPosting and skills, with a
> "mappintType" on the mapping indicating which skills are required and
> which are just desirable -- those should probably go in seperate fields of
> your index, but some code somewhere needs to know that.
>
> once you've solved that problem, once you've got a function that you can
> point at your DB, give it a primary key and get back a "flattened" view of
> the data that can represent your "Solr/Lucene Document" you're 80% done
> ... the problem is that 80% isn't a genericly solvable problem ... there
> aren't simple rules you can apply to any DB schema to drive that function.
>
> Even the last 20% isn't really generic; knowing when to re-index a
> particular "document" ... the needs of a system where individual people
> update JobPostings one at a time is very differnet from a system where
> JobPostings are bulk imported thousands at a time ... it's hard to write a
> usefull "indexer" that can function efficiently in both cases.  Even in
> the first case, dealing with individual document updates where the primary
> JobPosting data changes is only the "common" problem, there are still the
> less-common" situations where a Company name changes and *all* of the
> associated Job Postings need reindexed ... for small indexes it might be
> worthwhile to just rebuild the index from scratch, for bigger indexes you
> might need a more complex solution for dealing with this situation.
>
> The advice i give people at CNET when they need to build a Solr index is:
>
> 1) start by deciding what the minimum "freshness" is for your data ... ie:
> what is the absolute longest you can live with needing to wait for data to
> be added/deleted/updated in your Solr index once it's been
> added/deleted/modified in your DB.
>
> 2) write a function that can generate a Solr Document from an instance of
> your data (be it a bean, a DB row, whatever you've got)
>
> 3) write a simple wrapper program that iterates over all of yor data, and
> calls the function from #1
>
>
> If #3 takes less time to run then #1 - cron it to rebuild the index from
> scratch over and over again and use snapshooter and snappuller to expose
> itto the world ... if #3 takes longer then #1, then look at ways to more
> systematically decide docs should be updated, and how.
>
>
>
> -Hoss
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message