lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diego Pino <dp...@krayon.cl>
Subject Re: How to index data from multiple data source
Date Wed, 21 Jan 2015 16:06:48 GMT
Hi Yusniel,

Solr manages documents as a whole. This means updating an existing document means replacing.
So you should/could index metadata and full text in one step, one solr document under one
unique ID. That would the simplest case. You could also also use nested  child documents to
use block joins(depending on what version of Solr you are using, more info here: http://blog.griddynamics.com/2013/09/solr-block-join-support.html),
but in my opinion this would be an overkill. We also manage a type of "semantic - linked data"
mimic using  additional fields(named by real ontology predicate/property names to join documents
that are related, see https://wiki.apache.org/solr/Join). So you could add the full text as
an additional document with it's own ID and fill a solr document field with the ID of the
parent metadata document. The on query time you can join them. Joins in solr always give as
result the joined document(TO), not both (it's no like a SQL join, more like and inner query),
so we experimented with self joins (the field holding the parent ID document also holds it's
own ID), but as you can understand this is in no way optimal.

Related: We are using a Digital Objects Repository (Fedora Commons + Islandora) to archive
exactly what you wan't to do. Our PDF files, and also many other type of data and metadata,
are ingested as objects inside the repository, including technical metadata, MODS, DC, binary
stream and full text. Then this whole object (as a FOXML) goes through an XSLT transformation
and into Solr. If you are interested you can browse Islandoras google group. https://groups.google.com/forum/#!forum/islandora
and visit Islandora's WIKI. https://wiki.duraspace.org/display/ISLANDORA714/Islandora. There
is much documentation under the fedoragsearch module that does the real indexing. You can
see our schemas and solr config there. 

Feel free to write me if you need/wan't more data.

Cheers

Diego Pino Navarro
Krayon Media
Pedro de Valdivia 575
Pucón - Chile
F:+56-45-2442469




On Jan 21, 2015, at 2:43 AM, Yusniel Hidalgo Delgado <yhdelgado@uci.cu> wrote:

> 
> 
> Dear Solr community, 
> 
> 
> 
> 
> I am diving into Solr recently and I need help in the following usage scenery. I am working
on a project for extract and search bibliographic metadata from PDF files. Firstly, my PDF
files are processed to extract bibliographic metadata such as title, authors, affiliations,
keywords and abstract. These metadata are stored in a relational database and then are indexed
in Solr via DIH, however, I need to index also the fulltext of PDF and maintain the same ID
between metadata indexed and fulltext of PDF indexed in Solr index. How to do that? How to
configure sorlconfig.xml and schema.xml to do it? 
> 
> 
> 
> 
> Thanks in advance. 
> 
> 
> 
> 
> Best regards 
> 
> Yusniel Hidalgo Delgado 
> Semantic Web Research Group 
> University of Informatics Sciences 
> http://gws-uci.blogspot.com/ 
> Havana, Cuba 
> 
> 
> 
> 
> ---------------------------------------------------
> XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años
de historia junto a Fidel. 12 de diciembre de 2014.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message