lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cam Bazz <>
Subject Re: lucene link database
Date Sun, 08 Oct 2006 17:14:51 GMT
Dear Erick;

Thank you for your detailed insight. I have been trying to code a graph 
object database for sometime.
I have prototyped on relational as well as object oriented databases, 
including opensource and commercial implementations.
(so far, I have tried hibernate, objectivity/db, db4o) while object 
databases excel in traversing links, they are poor when searching.

lucene so far solves the problem of solving. I am thinking of a document 
as a list of tuples. (sequence of fields) and I can do searches with 
lucene, it is really nice.

now I have to solve the problem of linking. if I keep the nodes with a 
lucene index, and I can fetch documents with a doc_id, or some sort of 
surrogate identifier, and
use those identifiers as node_id in an object graph, that will be what I 
want. but in order to do that I need to be able to query the lucene 
index by document_id.

I was referring to the link db of the nutch. They do have some sort of 
link db implementation, that runs with hadoop, but I have not understood 
the full code.
I am trying to understand the structure of this link database. I was 
thinking of using documents with src and dst fields, that have document 
id's as values. (one idea, I will try it tomorrow)

Again thanks a bunch.

Best Regards,

Erick Erickson wrote:
> Aproach it in whatever way you want as long as it solves your problem 
> <G>.
> My first question is why use lucene? Would a database suit your needs
> better? Of course, I can't say. Lucene shines at full-text searching, so
> it's a closer call if you aren't searching on parts of text. By that I 
> mean
> that if you're not searching on *parts* of your links, you may want to
> consider a DB solution.
> That said, and if I understand your requirement, you have a pretty simple
> design. Each document has two fields, incominglinks and outgoing 
> links. But
> see the note below. Lucene indexes what you give it, so the fact that 
> some
> of the links aren't hypertext links is immaterial to Lucene. Since you
> control both the indexer and searcher, these confrom to whatever your
> requirements are. It's up to you to map semantics onto these entities.
> One common trap DB-savvy people have is that they think of documents as
> entries in a table, all with the same fields. There is nothing 
> requiring you
> to have the *same* fields in each document in an index. You could have an
> index for which no two documents shared *any* common field if you choose.
> So, if you want to find out what, say, which documents have link X as an
> incoming link, just search on incominglinks:X. If you wanted to find the
> documents that had any incoming links X, Y, Z that matched an outgoing 
> link
> in another document, just search the OR of these in outgoinglinks.
> If you want some kind of map of the whole web of links, you'll have to 
> write
> some iterative loop and keep track. There's nothing built in that I 
> know of
> that lets you answer "Given link X, show me all the documents no more 
> than 3
> hops away". Lucene is an *engine*, designed to have apps built on top 
> of it.
> Lucene doesn't deal with relations between documents, just searching what
> you've indexed.
> It's easy enough to store a variable number of links in your 
> incominglinks
> or outgoinglinks field. Just be sure they're tokenized appropriately. You
> can add them any way you choose, either concatenate them all into a big
> string and index that, or index them into the same field, e.g.
> Document doc = new Document();
> doc.add("incoming", "link1");
> doc.add("incoming", "link2");
> .
> .
> .
> writer.add(doc);
> According to a discussion from a while ago, this is the same as
> doc.add("incoming", "link1 link2");
> in terms of how it all gets handled internally.
> NOTE: I'm skipping most of the question of which Analyzer you use. 
> This will
> almost surely trip you up sometime. I'd suggest starting with
> WhitespaceAnalyzer as that's more intuitive. Some of the other analyzers
> will break your links up in ways you don't expect. Really, really, really
> get a copy of Luke to see what's actually *in* your index and how 
> searches
> work. And how the analyzer you choose changes what's searched for, as 
> well
> as what's indexec. Google lucene luke and you'll find it.
> Anyway, hope this all helps.
> Erick
> On 10/8/06, Cam Bazz <> wrote:
>> Hello,
>> I would like to make a link database using lucene. Similar to one that
>> nutch uses. I have read the basic documentation and understood how
>> document indexing, search, and scoring works. But what I like is
>> different documents having different kind of links (semantic links) to
>> each other. I would like to be able to search in the database like
>> incominglinksofdocument(id), outgoinglinksofdocument(id). the links I am
>> talking about, might not necessarily be hypertext links.
>> How would I approach to a problem like this?
>> Best Regards,
>> -C.B.
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message