lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: MySQL data import
Date Mon, 12 Dec 2011 22:31:05 GMT
Here's a quick demo I wrote at one point. I haven't run it in a while,
but you should be able to get the idea.


package jdbc;


import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.common.SolrInputDocument;

import java.io.IOException;
import java.sql.*;
import java.util.ArrayList;
import java.util.Collection;


public class Indexer {
  public static void main(String[] args) {
    startIndex("http://localhost:8983/solr");
  }

  private static void startIndex(String url) {
    Connection con = DataSource.getConnection();
    try {

      long start = System.currentTimeMillis();
      // Create a multi-threaded communications channel to the Solr
server. Full interface (3.3) at:
      // http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html
      StreamingUpdateSolrServer server = new
StreamingUpdateSolrServer(url, 10, 4);

      // You may want to set these timeouts higer, Solr occasionally
will have long pauses while
      // segments merge.
      server.setSoTimeout(1000);  // socket read timeout
      server.setConnectionTimeout(100);
      //server.setDefaultMaxConnectionsPerHost(100);
      //server.setMaxTotalConnections(100);
      //server.setFollowRedirects(false);  // defaults to false
      // allowCompression defaults to false.
      // Server side must support gzip or deflate for this to have any effect.
      //server.setAllowCompression(true);
      server.setMaxRetries(1); // defaults to 0.  > 1 not recommended.
      server.setParser(new XMLResponseParser()); // binary parser is
used by default

      doDocuments(server, con);
      server.commit(); // Only needs to be done at the end, autocommit
or commitWithin should
      // do the rest.
      long endTime = System.currentTimeMillis();
      System.out.println("Total Time Taken->" + (endTime - start) + " mils");

    } catch (Exception e) {
      e.printStackTrace();
      String msg = e.getMessage();
      System.out.println(msg);
    }
  }

  private static void doDocuments(StreamingUpdateSolrServer server,
Connection con) throws SQLException, IOException, SolrServerException
{

    Statement st = con.createStatement();
    ResultSet rs = st.executeQuery("select id,title,text from test");

    // SolrInputDocument interface (3.3) at
    // http://lucene.apache.org/solr/api/org/apache/solr/common/SolrInputDocument.html
    Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
    int total = 0;
    int counter = 0;

    while (rs.next()) {
      SolrInputDocument doc = new SolrInputDocument(); // DO NOT move
this outside the while loop
      // or be sure to call doc.clear()

      String id = rs.getString("id");
      String title = rs.getString("title");
      String text = rs.getString("text");

      doc.addField("id", id);
      doc.addField("title", title);
      doc.addField("text", text);

      docs.add(doc);
      ++counter;
      ++total;
      if (counter > 1000) { // Completely arbitrary, just batch up
more than one document for throughput!
        server.add(docs);
        docs.clear();
        counter = 0;
      }
    }
    System.out.println("Total " + total + " Docs added succesfully");

  }
}

// Trivial class showing connecting to a MySql database server via jdbc...
class DataSource {
  public static Connection getConnection() {
    Connection conn = null;
    try {

      Class.forName("com.mysql.jdbc.Driver").newInstance();
      System.out.println("Driver Loaded......");
      conn = DriverManager.getConnection("jdbc:mysql://172.16.0.169:3306/test?"
+ "user=testuser&password=test123");
      System.out.println("Connection build......");
    } catch (Exception ex) {
      System.out.println(ex);
    }
    return conn;
  }

  public static void closeConnection(Connection con) {
    try {
      if (con != null)
        con.close();
    } catch (SQLException e) {
      e.printStackTrace();
    }
  }
}

On Mon, Dec 12, 2011 at 2:57 PM, Brian Lamb
<brian.lamb@journalexperts.com> wrote:
> Thanks all. Erick, is there documentation on doing things with SolrJ and a
> JDBC connection?
>
> On Mon, Dec 12, 2011 at 1:34 PM, Erick Erickson <erickerickson@gmail.com>wrote:
>
>> You might want to consider just doing the whole
>> thing in SolrJ with a JDBC connection. When things
>> get complex, it's sometimes more straightforward.
>>
>> Best
>> Erick...
>>
>> P.S. Yes, it's pretty standard to have a single
>> field be the destination for several copyField
>> directives.
>>
>> On Mon, Dec 12, 2011 at 12:48 PM, Gora Mohanty <gora@mimirtech.com> wrote:
>> > On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb
>> > <brian.lamb@journalexperts.com> wrote:
>> >> Hi all,
>> >>
>> >> I have a few questions about how the MySQL data import works. It seems
>> it
>> >> creates a separate connection for each entity I create. Is there any
>> way to
>> >> avoid this?
>> >
>> > Not sure, but I do not think that it is possible. However, from your
>> description
>> > below, I think that you are unnecessarily multiplying entities.
>> >
>> >> By nature of my schema, I have several multivalued fields. Each one I
>> >> populate with a separate entity. Is there a better way to do it? For
>> >> example, could I pull in all the singular data in one sitting and then
>> come
>> >> back in later and populate with the multivalued items.
>> >
>> > Not quite sure as to what you mean. Would it be possible for you
>> > to post your schema.xml, and the DIH configuration file? Preferably,
>> > put these on pastebin.com, and send us links. Also, you should
>> > obfuscate details like access passwords.
>> >
>> >> An alternate approach in some cases would be to do a GROUP_CONCAT and
>> then
>> >> populate the multivalued column with some transformation. Is that
>> possible?
>> > [...]
>> >
>> > This is how we have been handling it. A complete description would
>> > be long, but here is the gist of it:
>> > * A transformer will be needed. In this case, we found it easiest
>> >  to use a Java-based transformer. Thus, your entity should include
>> >  something like
>> >  <entity name="myname" dataSource="mysource"
>> > transformer="com.mycompany.search.solr.handler.JobsNumericTransformer...>
>> >  ...
>> >  </entity>
>> >  Here, the class name to be used for the transformer attribute follows
>> >  the usual Java rules, and the .jar needs to be made available to Solr.
>> > * The SELECT statement for the entity looks something like
>> >  select group_concat( myfield SEPARATOR '@||@')...
>> >  The separator should be something that does not occur in your
>> >  normal data stream.
>> > * Within the entity, define
>> >   <field column="myfield"/>
>> > * There are complications involved if NULL values are allowed
>> >   for the field, in which case you would need to use COALESCE,
>> >   maybe along with CAST
>> > * The transformer would look up "myfield", split along the separator,
>> >   and populate the multi-valued field.
>> >
>> > This *is* a little complicated, so I would also like to hear about
>> > possible alternatives.
>> >
>> > Regards,
>> > Gora
>>

Mime
View raw message