So you want to highlight search terms and excerpt relevant phrases in your search results, and you want it done using the same stemming rules that your search engine uses to gather results?
Update 1/18/10: The newer version of Thinking Sphinx include support for excerpting and don’t work with the patch I present below. The patch worked in 1.13.5, but doesn’t work in 1.13.14. If you want the functionality presented here but have a modern version, try installing the plugin:
script/plugin install git@github.com:dfurber/thinking_sphinx_excerpts.git
Thinking Sphinx has the advantage of delta indexing. Ultrasphinx has lots of cool features such as excerpt highlighting. Wouldn’t it be nice if you could have TS’s delta indexing and search highlighting?
As it turns out, both Thinking Sphinx and Ultrasphinx are not only wrappers around Sphinx, but around another plugin that does the actual interfacing with Sphinx, called Riddle.
Riddle has an excerpt method (along with many other cool toys that I haven’t played with yet) that Ultrasphinx exposes but Thinking Sphinx does not.
It turned out to be straightforward to take the excerpts method out of US and stick it into TS. Using the patched TS, the way to have your search terms excerpted and highlighted is simply as follows.
For a single model:
User.search "david", :excerpts => true
For multiple models:
ThinkingSphinx.Search.search "david", :excerpts => true, :classes => [User, Post, Photo, Event]
The patch for Paginating Find:
Thinking Sphinx patch for paginating_find
I used to be a fan of paginating_find over will_paginate. This is the version I use in production. It performs the search, runs the results through Sphinx’s excerpt highlighter, and returns the results wrapped in a Paging Enumerator.
class Object def _metaclass class << self self end end end module ThinkingSphinx class Search class << self # Overwrite the configured content attributes with excerpted and highlighted versions of themselves. # Runs run if it hasn't already been done. def excerpts(results, client, parsed_query) return if results.empty? or client.nil? options = { :before_match => '', :after_match => '', :chunk_separator => "…", :limit => 256, :around => 5 } content_methods = %w{title name description} # the attributes of any model you would like to have excerpted # See what fields in each result might respond to our excerptable methods results_with_content_methods = results.map do |result| [result, content_methods.map do |methods| methods.detect do |this| result.respond_to? this end end ] end # Fetch the actual field contents docs = results_with_content_methods.map do |result, methods| methods.map do |method| method and strip_bogus_characters(result.send(method)) or "" end end.flatten excerpting_options = { :docs => docs, :index => "user_core", #MAIN_INDEX, # http://www.sphinxsearch.com/forum/view.html?id=100 :words => strip_query_commands(parsed_query.to_s) }.merge(options) responses = client.excerpts(excerpting_options) responses = responses.in_groups_of(content_methods.size) results_with_content_methods.each_with_index do |result_and_methods, i| # Override the individual model accessors with the excerpted data result, methods = result_and_methods methods.each_with_index do |method, j| data = responses[i][j] if method result._metaclass.send('define_method', method) { data } attributes = result.instance_variable_get('@attributes') attributes[method] = data if attributes[method] end end end results = results_with_content_methods.map do |result_and_content_method| result_and_content_method.first.freeze end results end def search_with_excerpts_and_pagination(*args) query = args.clone # an array options = query.extract_options! retry_search_on_stale_index(query, options) do results, client = search_results(*(query + [options])) ::ActiveRecord::Base.logger.error( "Sphinx Error: #{results[:error]}" ) if results[:error] klass = options[:class] page = options[:page] ? options[:page].to_i : 1 total = results[:total] results = ThinkingSphinx::Collection.create_from_results(results, page, client.limit, options) if options[:excerpts] and !results.empty? results = excerpts(results, client, query) end if options[:page] PagingEnumerator.new(client.limit, total, false, page, 1) do |pg| results end else results end end end alias_method_chain :search, :excerpts_and_pagination def strip_bogus_characters(s) # Used to remove some garbage before highlighting s.gsub(/<.*?>|\.\.\.|\342\200\246|\n|\r/, " ").gsub(/http.*?( |$)/, ' ') if s end def strip_query_commands(s) # XXX Hack for query commands, since Sphinx doesn't intelligently parse the query in excerpt mode # Also removes apostrophes in the middle of words so that they don't get split in two. s.gsub(/(^|\s)(AND|OR|NOT|\@\w+)(\s|$)/i, "").gsub(/(\w)\'(\w)/, '\1\2') end end end end
The Will Paginate Version
Please asked about a will_paginate version, so I put this together and tested it. It performs the query on Sphinx, then does the highlighting, and passes the results array through Thinking Sphinx’s own will_paginate collection class.
Thinking Sphinx patch for will_paginate
class Object def _metaclass class << self self end end end module ThinkingSphinx class Search class << self # Overwrite the configured content attributes with excerpted and highlighted versions of themselves. # Runs run if it hasn't already been done. def excerpts(results, client, parsed_query) return if results.empty? or client.nil? options = { :before_match => '', :after_match => '', :chunk_separator => "…", :limit => 256, :around => 5 } content_methods = %w{title name description} # the attributes of any model you would like to have excerpted # See what fields in each result might respond to our excerptable methods results_with_content_methods = results.map do |result| [result, content_methods.map do |methods| methods.detect do |this| result.respond_to? this end end ] end # Fetch the actual field contents docs = results_with_content_methods.map do |result, methods| methods.map do |method| method and strip_bogus_characters(result.send(method)) or "" end end.flatten excerpting_options = { :docs => docs, :index => "user_core", #MAIN_INDEX, # http://www.sphinxsearch.com/forum/view.html?id=100 :words => strip_query_commands(parsed_query.to_s) }.merge(options) responses = client.excerpts(excerpting_options) responses = responses.in_groups_of(content_methods.size) results_with_content_methods.each_with_index do |result_and_methods, i| # Override the individual model accessors with the excerpted data result, methods = result_and_methods methods.each_with_index do |method, j| data = responses[i][j] if method result._metaclass.send('define_method', method) { data } attributes = result.instance_variable_get('@attributes') attributes[method] = data if attributes[method] end end end results.results = results_with_content_methods.map do |result_and_content_method| result_and_content_method.first.freeze end results end def search_with_excerpts(*args) query = args.clone # an array options = query.extract_options! retry_search_on_stale_index(query, options) do results, client = search_results(*(query + [options])) ::ActiveRecord::Base.logger.error( "Sphinx Error: #{results[:error]}" ) if results[:error] klass = options[:class] page = options[:page] ? options[:page].to_i : 1 total = results[:total] results = ThinkingSphinx::Collection.create_from_results(results, page, client.limit, options) if options[:excerpts] and !results.empty? results = excerpts(results, client, query) end results end end alias_method_chain :search, :excerpts def strip_bogus_characters(s) # Used to remove some garbage before highlighting s.gsub(/<.*?>|\.\.\.|\342\200\246|\n|\r/, " ").gsub(/http.*?( |$)/, ' ') if s end def strip_query_commands(s) # XXX Hack for query commands, since Sphinx doesn't intelligently parse the query in excerpt mode # Also removes apostrophes in the middle of words so that they don't get split in two. s.gsub(/(^|\s)(AND|OR|NOT|\@\w+)(\s|$)/i, "").gsub(/(\w)\'(\w)/, '\1\2') end end end end
Simply place the code in a file in the config/initializers folder of your application, and the magic will appear the way it almost always does in Rails.
Update 6/4/09 @ 9:30AM:
1. The excerpt method in Thinking Sphinx wants to examine one of your indices (any one) for character encoding and such. See here. I have used “user_core”. If you don’t have a User model or haven’t defined an index on it, then you will get an “unknown index” error. Simply search the patch for “user_core” and replace it with “#{model_you_have_indexed}_core”.
2. There was an error in the will_paginate version in which the excerpt method was taking the paginated collection and returning a simple array. The patch has been updated so that the excerpt method replaces the results returned by TS with the excerpt highlighted results, without removing the pagination info. My apologies to those who were caught by this problem.
Update 10/06/09
There has been a new version of Thinking Sphinx since I wrote this. I have not had the chance to update this code to match the new plugin. So, if it doesn’t work, that may be why. When I have a chance, I’ll fix it up.