Difference between revisions of "Info: SphinxSearch" - New World Encyclopedia

From New World Encyclopedia
m (Add link for ExtensionFunctions.php)
m (Set the date)
Line 2: Line 2:
 
| name    = SphinxSearch
 
| name    = SphinxSearch
 
| purpose  = Replaces default MediaWiki search with [http://www.sphinxsearch.com/ Sphinx]
 
| purpose  = Replaces default MediaWiki search with [http://www.sphinxsearch.com/ Sphinx]
| version  = 0.2
+
| version  = 0.3 (October 7, 2007)
 
| author  = Svemir Brkic
 
| author  = Svemir Brkic
 
| download = [http://www.newworldencyclopedia.org/src/SphinxSearch/sphinx.conf sphinx.conf]<br/>
 
| download = [http://www.newworldencyclopedia.org/src/SphinxSearch/sphinx.conf sphinx.conf]<br/>
Line 22: Line 22:
  
 
* Tweaked handling of namespaces, so initial search uses namespaces from user's preferences.
 
* Tweaked handling of namespaces, so initial search uses namespaces from user's preferences.
 +
 +
===Additional change in version 0.3===
 +
 +
* Changed the queries to use page_id from page table as the primary key (instead of old_id from text table.) This makes it possible to avoid duplicates when using the incremental index.
 +
 +
* Changed suggested folder and config names in [http://www.newworldencyclopedia.org/src/SphinxSearch/sphinx.conf sphinx.config] file. You do not need to use our suggested names, we just like them better this way.
  
 
==Install Sphinx==
 
==Install Sphinx==
  
To install our version of this extension, first [http://www.sphinxsearch.com/downloads.html download] and  [http://www.sphinxsearch.com/doc.html#installation install] Sphinx.
+
Before you install this extension, you need to [http://www.sphinxsearch.com/downloads.html download] and  [http://www.sphinxsearch.com/doc.html#installation install] Sphinx.
  
 
Now create a sphinx.conf file. Sphinx comes with well-commented sample file, but if you want to use our code, you need to start with [http://www.newworldencyclopedia.org/src/SphinxSearch/sphinx.conf our sphinx.conf] and modify it further if necessary:
 
Now create a sphinx.conf file. Sphinx comes with well-commented sample file, but if you want to use our code, you need to start with [http://www.newworldencyclopedia.org/src/SphinxSearch/sphinx.conf our sphinx.conf] and modify it further if necessary:
Line 41: Line 47:
 
  search --config /path/to/conf/sphinx.conf "test search"
 
  search --config /path/to/conf/sphinx.conf "test search"
  
You will see the result stats immediatelly (Sphinx is FAST) but the actual article content will take a while to start displaying. Do not worry - that is only because of a sub-optimal sql_query_info in sphinx.conf file. In the extension we can get to the article content much faster because we will have both page_id and old_id available.
+
You will see the result stats immediately (Sphinx is FAST.) Note that the article data you see at this point comes from the sql_query_info in sphinx.conf file. In the extension we can get to the actual article content because we have text old_id available as an extra attribute. It would be slow to fetch article content on the command line (we would have to join page, revision, and text tables,) so we just fetch page_title and page_namespace at this point.
  
 
==Install the extension==
 
==Install the extension==
  
Create a SphinxSearch sub-folder in your extensions folder and copy [http://www.newworldencyclopedia.org/src/SphinxSearch/SphinxSearch.php SphinxSearch.php] and [http://www.newworldencyclopedia.org/src/SphinxSearch/SphinxSearch_body.php SphinxSearch_body.php] there. Also, copy the sphinxapi.php file from your Sphinx installation api folder into SphinxSearch. Finally, if you do not have it already, download [http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/ExtensionFunctions.php ExtensionFunctions.php] library to your extensions folder.
+
* Create a SphinxSearch sub-folder in your extensions folder and copy [http://www.newworldencyclopedia.org/src/SphinxSearch/SphinxSearch.php SphinxSearch.php] and [http://www.newworldencyclopedia.org/src/SphinxSearch/SphinxSearch_body.php SphinxSearch_body.php] there.
 +
 
 +
* Copy the sphinxapi.php file from your Sphinx installation api folder into SphinxSearch.  
  
Add this line to your LocalSettings.php
+
* If you do not have it already, download [http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/ExtensionFunctions.php ExtensionFunctions.php] library to your extensions folder.
 +
 
 +
* Add this line to your LocalSettings.php
  
 
  require_once( "$IP/extensions/SphinxSearch/SphinxSearch.php" );
 
  require_once( "$IP/extensions/SphinxSearch/SphinxSearch.php" );
Line 64: Line 74:
  
 
* We use SPH_MATCH_EXTENDED for better relevance weights, but we process the search term to make it assume an OR instead of an AND on multiple. This will be replaced with an option on the search form.
 
* We use SPH_MATCH_EXTENDED for better relevance weights, but we process the search term to make it assume an OR instead of an AND on multiple. This will be replaced with an option on the search form.
 
* Indexing query currently uses the old_id from the text table as the primary key. That creates duplicate entries in the search results when incremental index indexes a newer revision. Maybe using page_id would solve that, but in that case search result excerpt would in some case come from a yet unindexed revision...
 
  
 
* Due to the way the weights are calculated, it is hard to get title matches to always appear first. That can be solved by internally running the search twice, first time with @page_title attribute, second time with @old_text.
 
* Due to the way the weights are calculated, it is hard to get title matches to always appear first. That can be solved by internally running the search twice, first time with @page_title attribute, second time with @old_text.

Revision as of 01:16, 8 October 2007

SphinxSearch
Purpose Replaces default MediaWiki search with Sphinx
Author Svemir Brkic
Version 0.3 (October 7, 2007)
Extension files
sphinx.conf

SphinxSearch.php
SphinxSearch_body.php

Introduction

This site uses Sphinx full-text search engine as a replacement for standard MediaWiki search. Our code is based on SphinxSearch extension by Paul Grinberg. The changes we made include:

  • Changed the overall model based on LuceneSearch extension to make Sphinx the main (and only) search engine on this wiki.
  • Added support for the Go button so user is redirected to an exact title match automatically.
  • Changed the main data fetch query for the indexer to make it somewhat faster. Added page_id attribute to the index to make fetching of search results faster as well.
  • Added an incremental index. We index the entire wiki every night and update the incremental index with recently changed articles several times a day.
  • Tweaked handling of namespaces, so initial search uses namespaces from user's preferences.

Additional change in version 0.3

  • Changed the queries to use page_id from page table as the primary key (instead of old_id from text table.) This makes it possible to avoid duplicates when using the incremental index.
  • Changed suggested folder and config names in sphinx.config file. You do not need to use our suggested names, we just like them better this way.

Install Sphinx

Before you install this extension, you need to download and install Sphinx.

Now create a sphinx.conf file. Sphinx comes with well-commented sample file, but if you want to use our code, you need to start with our sphinx.conf and modify it further if necessary:

  • You will need to update the file paths and table names (if you use some prefix.)
  • If your wiki is very large, you may want to consider specifying a query range in the conf file.
  • If your wiki is not in English, you will need to change (or remove) the morphology attribute.

When done, run the indexer:

indexer --config /path/to/conf/sphinx.conf --all --rotate

You will see a report about documents being fetched etc. If everything seems fine, do a test search:

search --config /path/to/conf/sphinx.conf "test search"

You will see the result stats immediately (Sphinx is FAST.) Note that the article data you see at this point comes from the sql_query_info in sphinx.conf file. In the extension we can get to the actual article content because we have text old_id available as an extra attribute. It would be slow to fetch article content on the command line (we would have to join page, revision, and text tables,) so we just fetch page_title and page_namespace at this point.

Install the extension

  • Copy the sphinxapi.php file from your Sphinx installation api folder into SphinxSearch.
  • Add this line to your LocalSettings.php
require_once( "$IP/extensions/SphinxSearch/SphinxSearch.php" );

Keep the index updated

Setup a cron job for the full index - for example once every night:

0 3 * * * /usr/local/bin/indexer --config /usr/local/etc/sphinx.conf wiki --rotate > /dev/null 2>&1

Setup a more frequent cron to update the smaller index regularly:

0 9,15,22 * * * /usr/local/bin/indexer --config /usr/local/etc/sphinx.conf wikilatest > /dev/null --rotate 2>&1

Hacks and TODOs

  • We use SPH_MATCH_EXTENDED for better relevance weights, but we process the search term to make it assume an OR instead of an AND on multiple. This will be replaced with an option on the search form.
  • Due to the way the weights are calculated, it is hard to get title matches to always appear first. That can be solved by internally running the search twice, first time with @page_title attribute, second time with @old_text.

Feedback

This wiki uses a separate system for public feedback. Click the feedback tab and register (it is an integrated Wordpress blog.) You may also send an email directly to svemir at thirdblessing dot net.