Difference between revisions of "Info: SphinxSearch" - New World Encyclopedia

From New World Encyclopedia
m (Break things down a little more)
m (Update version)
 
(4 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
| name    = SphinxSearch
 
| name    = SphinxSearch
 
| purpose  = Replaces default MediaWiki search with [http://www.sphinxsearch.com/ Sphinx]
 
| purpose  = Replaces default MediaWiki search with [http://www.sphinxsearch.com/ Sphinx]
| version  = 0.3 (October 7, 2007)
+
| version  = 0.7
| author  = Svemir Brkic
+
| author  = Svemir Brkic and Paul Grinberg
| download = [http://www.newworldencyclopedia.org/src/SphinxSearch/sphinx.conf sphinx.conf]<br/>
+
| download = [http://www.mediawiki.org/wiki/Extension:SphinxSearch Extension page at MediaWiki.org]<br/>
[http://www.newworldencyclopedia.org/src/SphinxSearch/SphinxSearch.php SphinxSearch.php]<br/>
+
[http://sourceforge.net/projects/sphinxsearch/ Files hosted at at SourceForge]
[http://www.newworldencyclopedia.org/src/SphinxSearch/SphinxSearch_body.php SphinxSearch_body.php]
 
 
}}
 
}}
  
==Introduction==
+
===Introduction===
  
This site uses [http://www.sphinxsearch.com/ Sphinx full-text search engine] as a replacement for standard [http://www.mediawiki.org/ MediaWiki] search. Our code is based on [http://www.mediawiki.org/wiki/Extension:SphinxSearch SphinxSearch extension] by Paul Grinberg. The changes we made include:
+
This site uses [http://www.sphinxsearch.com/ Sphinx full-text search engine] as a replacement for standard [http://www.mediawiki.org/ MediaWiki] search. Our initial code was based on [http://www.mediawiki.org/wiki/Extension:SphinxSearch SphinxSearch extension] by Paul Grinberg and changes we made were later merged with the official version.
  
* Changed the overall model based on [http://www.mediawiki.org/wiki/Extension:LuceneSearch LuceneSearch extension] to make Sphinx the main (and only) search engine on this wiki.
+
===Installation===
  
* Added support for the Go button so user is redirected to an exact title match automatically.
+
See [http://www.mediawiki.org/wiki/Extension:SphinxSearch official extension page] for detailed installation instructions for both [http://www.sphinxsearch.com/ Sphinx] and the extension itself. If you already tried to install [http://www.mediawiki.org/wiki/Extension:LuceneSearch LuceneSearch] and found it too complicated, you will probably be pleasantly surprised by Sphinx.
  
* Changed the main data fetch query for the indexer to make it somewhat faster. Added page_id attribute to the index to make fetching of search results faster as well.
+
===Development===
  
* Added an incremental index. We index the entire wiki every night and update the incremental index with recently changed articles several times a day.
+
We are still working to improve this extension to make it use more of the Sphinx capabilities and to make it better in the wiki context. We will be adding more options, better title matching, and image thumbnails in search results.
  
* Tweaked handling of namespaces, so initial search uses namespaces from user's preferences.
+
===Feedback===
  
===Additional change in version 0.3===
+
This wiki uses a [[Info:Feedback|separate system]] for public feedback. Click the feedback tab and register (it is an integrated Wordpress blog.) You may also send an email directly to ''svemir at thirdblessing dot net''. Best way to communicate regarding this extension is the [http://www.mediawiki.org/wiki/Extension_talk:SphinxSearch extension talk page] at MediaWiki.org.
 
 
* Changed the queries to use page_id from page table as the primary key (instead of old_id from text table.) This makes it possible to avoid duplicates when using the incremental index.
 
 
 
* Changed suggested folder and config names in [http://www.newworldencyclopedia.org/src/SphinxSearch/sphinx.conf sphinx.config] file. You do not need to use our suggested names, we just like them better this way.
 
 
 
==Install Sphinx==
 
 
 
Before you install this extension, you need to [http://www.sphinxsearch.com/downloads.html download] and  [http://www.sphinxsearch.com/doc.html#installation install] Sphinx.
 
 
 
Now create a sphinx.conf file. Sphinx comes with well-commented sample file, but if you want to use our code, you need to start with [http://www.newworldencyclopedia.org/src/SphinxSearch/sphinx.conf our sphinx.conf] and modify it further if necessary.
 
 
 
* Set correct database, username, and password for your MediaWiki database
 
* Update table names in SQL queries if your MediaWiki installation uses a prefix
 
* Update the file paths (/var/data/sphinx/..., /var/log/sphinx/...) and create folders as necessary
 
* If your wiki is very large, you may want to consider specifying a [http://www.sphinxsearch.com/doc.html#ranged-queries query range] in the conf file.
 
* If your wiki is not in English, you will need to change (or remove) the morphology attribute.
 
 
 
==Index and test==
 
 
 
When done, run the indexer:
 
 
 
indexer --config /path/to/conf/sphinx.conf --all
 
 
 
You will see a report about documents being fetched etc. If everything seems fine, do a test search:
 
 
 
search --config /path/to/conf/sphinx.conf "test search"
 
 
 
You will see the result stats immediately (Sphinx is FAST.) Note that the article data you see at this point comes from the sql_query_info in sphinx.conf file. In the extension we can get to the actual article content because we have text old_id available as an extra attribute. It would be slow to fetch article content on the command line (we would have to join page, revision, and text tables,) so we just fetch page_title and page_namespace at this point.
 
 
 
==Install the extension==
 
 
 
* Create a SphinxSearch sub-folder in your extensions folder and copy [http://www.newworldencyclopedia.org/src/SphinxSearch/SphinxSearch.php SphinxSearch.php] and [http://www.newworldencyclopedia.org/src/SphinxSearch/SphinxSearch_body.php SphinxSearch_body.php] there.
 
 
 
* Copy the sphinxapi.php file from your Sphinx installation api folder into SphinxSearch.
 
 
 
* If you do not have it already, download [http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/ExtensionFunctions.php ExtensionFunctions.php] library to your extensions folder.
 
 
 
* Add this line to your LocalSettings.php
 
 
 
require_once( "$IP/extensions/SphinxSearch/SphinxSearch.php" );
 
 
 
==Start the search deamon==
 
 
 
Deamon listens for search queries from clients, such as the extension you just installed. You can start it manually like this:
 
 
 
searchd --config /usr/local/etc/sphinx.conf &
 
 
 
You probably also want to make it start automatically when your server is rebooted. Ours did not need a reboot since we got it almost two years ago, so I can not guarantee this will work :-) The simplest way is to add a line like this to your /etc/rc.local file. Make sure to set correct path to searchd.
 
 
 
/usr/local/bin/searchd/searchd --config /usr/local/etc/sphinx.conf
 
 
 
==Keep the index updated==
 
 
 
Setup a cron job for the full index - for example once every night:
 
 
 
0 3 * * * /usr/local/bin/indexer --config /usr/local/etc/sphinx.conf wiki_main --rotate > /dev/null 2>&1
 
 
 
Setup a more frequent cron to update the smaller index regularly:
 
 
 
0 9,15,22 * * * /usr/local/bin/indexer --config /usr/local/etc/sphinx.conf wiki_incremental --rotate > /dev/null --rotate 2>&1
 
 
 
Note that --rotate options is needed if searchd deamon is already running, so that the indexer does not modify in the index file while it is being used. It creates a new files and copies it over the existing one when it is done.
 
 
 
==Hacks and TODOs==
 
 
 
* We use SPH_MATCH_EXTENDED for better relevance weights, but we process the search term to make it assume an OR instead of an AND on multiple. This will be replaced with an option on the search form.
 
 
 
* Due to the way the weights are calculated, it is hard to get title matches to always appear first. That can be solved by internally running the search twice, first time with @page_title attribute, second time with @old_text.
 
 
 
==Feedback==
 
 
 
This wiki uses a [[Info:Feedback|separate system]] for public feedback. Click the feedback tab and register (it is an integrated Wordpress blog.) You may also send an email directly to ''svemir at thirdblessing dot net''.
 

Latest revision as of 18:32, 25 February 2010

SphinxSearch
Purpose Replaces default MediaWiki search with Sphinx
Author Svemir Brkic and Paul Grinberg
Version 0.7
Extension files
Extension page at MediaWiki.org

Files hosted at at SourceForge

Introduction

This site uses Sphinx full-text search engine as a replacement for standard MediaWiki search. Our initial code was based on SphinxSearch extension by Paul Grinberg and changes we made were later merged with the official version.

Installation

See official extension page for detailed installation instructions for both Sphinx and the extension itself. If you already tried to install LuceneSearch and found it too complicated, you will probably be pleasantly surprised by Sphinx.

Development

We are still working to improve this extension to make it use more of the Sphinx capabilities and to make it better in the wiki context. We will be adding more options, better title matching, and image thumbnails in search results.

Feedback

This wiki uses a separate system for public feedback. Click the feedback tab and register (it is an integrated Wordpress blog.) You may also send an email directly to svemir at thirdblessing dot net. Best way to communicate regarding this extension is the extension talk page at MediaWiki.org.