Rails and SOLR: My RubyNation 2012 Talk

My talk, Rails and the Apache SOLR Search Engine is now available online, in PDF form or via Slideshare. I just presented it on March 24, 2012 at RubyNation 2012.



Comments

David Keener By dkeener on Saturday, March 31, 2012 at 04:26 PM EST

I was pleased that my talk was well-received by the audience at RubyNation 2012. At the conclusion, there was a fairly extensive question and answer section. I thought it might be helpful to Rubyists to append some of the questions and their answers to this blog entry. Many thanks to my co-worker, Johnathan Quigg, for taking notes about the questions that were asked.

If you want to support highlighting with SOLR, is it absolutely necessary to store the full content for an attribute in the index?

There are trade-offs here. It's not necessary to store the attribute to highlight if you want to use your own object#highlight method. But if you do store the field then you have the advantage of not actually instantiating the object to retrieve the highlight. When displaying a page with 100 results, you wouldn't even need the object to show highlights, which can provide a performance boost.




David Keener By dkeener on Saturday, March 31, 2012 at 04:41 PM EST

Another question from the audience...

How fast does SOLR index documents?

Indexing speed depends on a lot of different factors, including the size of the document, the power of the underlying hardware (including both CPU and I/O performance), and the complexity of the analysis that needs to be done. Accordingly, there are reports of indexing speeds varying from 10 documents per second to more than 150 documents per second.

I'm not aware any detailed benchmarks that have been compiled, but somebody out there on the Internet has probably got something.




David Keener By jquigg on Saturday, March 31, 2012 at 04:45 PM EST

Indexing performance definitely depends on some things external to Solr, but Sunspot does actually publish some results that they have seen. I don't remember the exact URL, but it's out there on the Sunspot Wiki. It's pretty high.


David Keener By dkeener on Saturday, March 31, 2012 at 05:57 PM EST

In my talk, I mentioned the need to sanitize user-provided search criteria if they were going to be re-displayed (so that, as an example, illicit JavaScript isn't executed). Another audience member asked about the security implications of not sanitizing the SOLR query itself.

It's true that it can be possible to modify the query, but this entirely depends on the query parser that is used. The default parser for SOLR with Sunspot is the Dismax parser, which supports a very limited query syntax. With the limited syntax, query modification isn't a problem.

However, we use the Extended Dismax parser on our project, which allows full Lucene query syntax, and is quite, quite powerful. We were actually dinged during a security code review for allowing "{}"'s and "[]"'s in our query, which allow the user to dynamically modify the query and essentially search for anything they want. SOLR actually has a page that describes the sanitization of a full-powered Lucene query.




David Keener By jquigg on Saturday, March 31, 2012 at 06:01 PM EST

More on the security: Search engines are designed to be agnostic w.r.t. security. There's another Apache product called ManifoldCF that is specifically designed for search-based security and integration with ActiveDirectory. Manifold supports two products: Solr and, of all things, EMC Documentum.

Security and search are at odds with each other. When indexing, the ideal situation is to denormalize your database in to a document. You'll have many documents, and the documents will be complex, holding all of the related items rolled up in to them. This is absolutely ideal for search because when determining relevance, you can easily do so within a document. Keep in mind that these virtual documents are what you hit on.

With a security model that has ACL's on sub-documents, you're going to need to balance your denormalization at the correct point. If I shouldn't be able to see a SubPost on a Post, then the Solr Post document should not contain the SubPost. This impacts relevance.

The way that ManifoldCF gets around this issue is that there are multiple indices, and users are allows to search an index based on what they are allowed to see.



Leave a Comment

Comments are moderated and will not appear on the site until reviewed.

(not displayed)