lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Greenspan <aar...@thinkcomputer.com>
Subject Setting up Solr
Date Wed, 05 Jun 2013 05:48:14 GMT
Hi everyone,

I thought I'd document my process of getting set up with Solr 4.3.0 on a Linux server in case
it's of use to anyone. I'm a moderately experienced Linux system administrator, so without
passing judgment (at least for now), let me just say that I found getting Solr to work to
be extremely difficult--more difficult than just about any other package I've ever dealt with,
including ones I've built from source.

I downloaded the .tgz file from the Apache site without a problem and decompressed it into
its own directory. I was surprised to find the unconventional (at least in my experience)
directory structure, where all of the important files are contained in the "example" directory
(and really a second "solr" directory under that). Following the directions I'd read on-line
and in the README, I got it running pretty quickly with "java -jar start.jar" and went to
the web interface on port 8983 of my server. Here's where the problems began.

First, a note: the install wiki contains an error, or at least a very misleading piece of
text, on the installation page (http://wiki.apache.org/solr/SolrInstall), one of many in the
wiki. Port 8983 is indeed "a port other than 8080." (And since I'm talking about errors, "containers"
should have an apostrophe.)

The server status dashboard showed up fine, and I poked around to figure out what was what.
In short order, I noticed that Solr had already thrown a warning on the Logging section about
"/non/existent/dir/yields/warning", which didn't make much sense to me since I hadn't really
done anything yet. I looked into that some more and wrote up a bug here: https://issues.apache.org/jira/browse/SOLR-4890.
I don't think I've ever seen another piece of software that deliberately warned users that
mistakes cause warnings, but I suppose there's a first time for everything.

Aside from that, I tried posting documents to the example collection1, which amazingly worked,
so, satisfied, I decided to delete it and make my own new collection.

This was a mistake, apparently. The Solr web console can't function without at least one core
at all times--but it doesn't tell you that until after you've deleted it and it's totally
non-functional. To a novice, this is scary. Hence bug number two: https://issues.apache.org/jira/browse/SOLR-3633.

I didn't have any idea how to get Solr working again--there are way too many XML configuration
files in way too many directories for a new user to figure them all out. So I just started
from scratch by decompressing the .tgz file again, and went back to my default state, which
again warned me about warnings.

Now I knew not to delete the collection1 core. So I left it alone, and tried to make a new
one of my own. This threw an error. The new core could not be created. Why? Because the user
is expected to create a directory ahead of time corresponding to that core via the shell,
at least according to Stefan Matheis in bug number three's discussion: https://issues.apache.org/jira/browse/SOLR-4461.

If you look at the comments for that bug you'll see what I wrote there: "So I created a new
folder with the name of the core I wanted in the same place that I found the collection1 folder.
That didn't work. I got the same error. Then I looked at the README.txt file for the collection1
folder and saw that you are actually supposed to duplicate the collection1 folder for your
new core. (In that case, the web UI, which really doesn't want you to delete collection1 anyway,
should just treat collection1 as some kind of default template that you are encouraged to
duplicate to create a new core.)
So with the folder duplicated, I tried adding my new core again. It kind of worked. I got
a new listing on the left-hand side, but I also got this new error:

SolrCore Initialization Failures

new_core: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could
not load config for solrconfig.xml"

The suggested solution, involving file locking, did not help. solrconfig.xml wasn't loading
because my data import handler wasn't loading because MySQL's configuration, which I had put
into a data-config.xml file (which for some reason always has a hyphen in the docs when solrconfig.xml
does not), wasn't able to be used because the MySQL Java connector wasn't loading because
the connector was nowhere to be found. The wiki doesn't talk about this at all. The closest
it gets is "You might need to download and install the Oracle JDBC Driver in the /lib directory
of your Solr installation." under the heading of "Oracle Example." I don't use Oracle (even
though MySQL is an Oracle product now).

Only in some of the on-line documents I read did it even mention that there was something
you had to download separately from MySQL at http://www.mysql.com/downloads/connector/j/,
and the particular blog I found suggesting this told me to put the whole decompressed folder
in the lib directory, which I soon learned was not going to help at all. In fact, you just
needed the JAR file. But whether I put the JAR file in lib, or dist, or /var/share/java (with
or without symlinks), or any folder at all, Solr refused to find it.

Finally, I caught a clue from some of the lines referencing libraries in solrconfig.xml and
realized I might have to specifically tell Solr to look for the MySQL library by including
a line for it. First I tried using a basic regular expression of "mysql*\.jar", which didn't
work. Then in a fit of desperation (by this point I had consumed many, many hours on getting
seemingly nowhere with Solr) I tried "mysql-connector-java-\d.*\.jar" and it finally worked.
I'm not sure why, as I think both expressions are valid.

But it didn't really work. Solr could connect to MySQL, but that didn't mean it could import
anything. For that I had to set up a proper Solr schema, which took me a while to understand
because the schema actually spans two XML configuration files, and because the "collection1"
example schema seemed to handle both information from a hypothetical relational database table
about products, and generic fields from Word documents. I planned to handle neither, but I
did need to import multiple tables. So I deleted all of those fields from my schema and learned
that doing so was a great way to make Solr crash, as in, fail to load your core at all and
throw lots of Java exceptions. In particular, it didn't like that I had removed the id and
_version_ fields. So, I put the fields back and then very carefully changed them until I got
my queries to work (after many more hours).

Solr crashed some more because I had date values in MySQL of 0000-00-00, which is a pretty
common occurrence. I needed to append "?zeroDateTimeBehavior=convertToNull" to my JDBC connection
string in data-config.xml for that to start working. This was not obvious to me.

Solr also crashed whenever I made an SQL error, of course. Only it never said that there was
an SQL error per se. (Isn't there a call to the C or Java equivalent of PHP's mysql_error()
in the JDBC connector somewhere?)

At one point I had an inner entity referencing an outer entity's ID that just refused to fill
in the variable ${outer.keyid} with anything--because, I realized, the keyid field was missing
from my query, because I had had to concatenate it with the table name as a string, e.g. CONCAT('table-',`keyid`)
AS `id`, in order to make one global "id" that Solr would like. For some reason, this missing
key failed silently, whereas everything else I did caused massive numbers of errors, and so
MySQL spent a lot of time looking for records that had a NULL key.

The web interface was confusing in a number of respects. In its default state, the core selector
on the bottom left panel looks like a disabled combo box, so it took me about an hour to realize
it was even there. Documentation about Solr tends to reference "the query tab," but in 4.3.0
there is no "query tab," just this disabled combo box that happens to be hiding a query UI
and section for each core, completely separate from the Core Admin section at the top. Sometimes
when the web interface had to display an error, such as a long warning or error in the Logging
section, the left and right panes would become disjointed and content would start to pile
up on top of itself.

There appears to be no obvious way to secure the web interface with something so obvious as
a username and password, which has me worried, and wondering how many vulnerable servers there
are out there with port 8983 open for all the world to see. Nor do I recall seeing any obvious
way to change the active port through the web interface to something else unlikely to be guessed
(though a port scan would render that irrelevant). All of the blogs I've found about the security
issue reference sections of configuration files that don't seem to exist anymore in 4.3.0
so I have no idea what to do.

Before I got MySQL running, the web interface unhelpfully told me that no data import modules
were set up, and rather than indicating what modules were available or some way I might be
able to change that or configure them, I was left to figure it out for myself. After I got
it running, it became apparent that knowing the *latest* status was for some reason an option,
and if you didn't check the box, you'd only have stale and unhelpful information, unless you
also looked at the command prompt. Depending on one's server, that might be easy or hard.

I eventually did get my MySQL import queries to work, and then tried some example searches.
I got back no results no matter what I tried. First, I realized that I had to reload each
core through the Core Admin section at the top in order for it to realize that there were
now documents present in the database. If I searched for *.*, Solr showed that there were
documents in the database. Still, for any other query, there were no results. Then I did some
digging around the internet and realized that I had to use the unintuively-named dismax query
parser, which I'd never heard of. Since every field in the web interface query section is
labeled with its code value, and no hint of what those letters might mean, I had no idea what
I was looking at (and basically still don't). Finally, though, I was able to get some basic
queries to work.

This process leaves me with some questions for the Solr community:
- Are XML configuration files the best way to do this, or are they merely convenient for Java
programmers?
- Are XML configuration files that are 90% [unhelpful] comments and *deliberate, punitive,
pre-emptive warnings built-in* the best way to do user documentation?
- Why bother with a web interface if it's just going to force you to use the command line
anyway?

So at this point let me conclude by summarizing all of this with a more judgmental, and I
think substantiated, statement. Solr features the worst-designed user experience I have ever
seen in an enterprise-grade program, and I've used some pretty awful software (SCO OpenServer,
Microsoft Exchange Server 5.5, etc.). The search engine, in contrast, works great, which is
why I'm bothering to write this at all. Nonetheless, I don't care if it's open-source or closed-source.
No program should work like this--and certainly not anything called "version 4."

I say this not because I enjoy starting flame wars or because I have the time to participate
in them--I don't. I realize that there's a long history to Solr and I am the new kid who doesn't
get it. Nonetheless, that doesn't change the way it works, and many users will be just like
me. So just know that I'd just like to see Solr improve--frankly, I need it to--and if these
issues were not already glaringly obvious, they should be now.

Aaron

	
Aaron Greenspan
President & CEO
Think Computer Corporation

telephone +1 415 670 9350
fax +1 415 373 3959
e-mail aarong@thinkcomputer.com
web http://www.thinkcomputer.com



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message