TSEP - The Search Engine Project

Contents

Author's Notes

TSEP (The Search Engine Project) has been in development since 2004. Girish started this project as a requirement for a site which needed a search engine to search its pages which numbered up to 150 pages approximately. So he went through the currently available Open Source search engines software, but was unable to find any that was easy to understand and setup for a web-master. Since then he has built and improved this software and will continue to do so. Olaf joined in since v0.9 beta and is responsible for the (main) development and documentation since then.

The primary objective of this software is 'ease of use'. If you still think that this software is difficult to setup and / or use please let us know, and we will personally help you as well as make us aware of the complexities. By submitting this software to the Open Source community we strive for this software to become the most powerful personal site search engine in the world.

We have made every effort to make the copyright notice nice and small, so please do not remove it from your site, so that others too can discover what a great tool TSEP is.

We are very interested where TSEP is being used. Therefore we would really appreciate it if you could contact us to let us know the web address where we can take a look (and grab a screenshot) or - if it's an intranet - to send us a screenshot.

A word about versioning: we publish a new version whenever we think one is ready. A version number does not indicate the quantity or complexity of the changes applied since its previous version. We might add a 0.001 to a version number but still have made huge changes to TSEP. In brief, we recommend downloading every new version of TSEP.

^ The Search Engine Project

^Installating TSEP

Before you start, backup your files and your database!

Also, when upgrading from a previous version, follow the installation procedure completely.

If - and only if - you have installed version 0.917 and above you can run the 'update 0923.sql' in phpMyAdmin to insert the new variable into the config which has come to use in version 0.923.

  1. Unzip the downloaded TSEP files. Make sure that you have the folder structure correct.
  2. Upload all files to your website, make sure that you have the folder structure correct.
  3. If you know what this means: chmod 666 the DBConnectionData.php file
  4. Open the install.php file in the admin directory and set the values which are appropriate for your database.

Continue with the next step: configuration.php

These files - when correctly executed - will create the database tables with some starting values. Below is a screenshot of the database model:

Database model for TSEP

  1. Open the configuration.php in the admin directory and make all the changes to your liking. Pay special attention to the TSEP path and the TSEP language.
  2. Do you want:
    1. to use the pre-made search page?
    2. include TSEP into your layout?

    1. to use the pre-made search page 'tsepsearch.php', you are done with the installation and you can continue with Security.
    2. to include the search page in your own page / layout, please follow these steps - make sure that the path to the two files are set correctly!
      1. You need to make sure your page is being processed as a PHP page, ususally this is the case when your page has the extension .php or .php3 add the following code between your SEARCH page <head> ..and.. </head> tag - otherwise the search will look ugly! This example assumes that search.php is located in the same directory as your search page, tsep.css is in a subdirectory called "css".
        <link href="css/tsep.css" rel="stylesheet" type="text/css" />
      2. add the search.php into your page at the position you wish to add the search function like this:
        <?php require ('search.php'); ?>

      You have now integrated TSEP into your own page. Continue with Security.

^Security

For security you might want to protect the include and the admin directory using .htaccess!

^First-Time-Preparation

Make sure you have set all the values to your need in the configuration.php file in the admin directory before you continue!

^Configuration

This was introduced in v0.912. Open the configutation.php directly after installation and set the correct values, especially for language and the TSEP path - and of course every other value.

^Indexing your Site

For populating the database with the values for the search engine to search for, you have to run the file 'indexer.php' (in the admin directory) with '?index' added to the end in the web browser address bar i.e. http://www.sitename.com/admin/indexer.php?index . Now input the details asked by the form. By submitting the form with the appropriate values you will see the results of the indexing after a few seconds. The script will provide a detailed information on the number of pages, the title, the URL, the size and the indexed words found by the indexing script. Also the entries you have made in the form are saved to the database for later re-use / re-indexing.

^Using TSEP

^Searching

To run a search, open your search page or the page we prepared for you called 'tsepsearch.php' in your browser and input the words to be searched. The search words are not case sensitive.

TSEP supports boolean search if you have a MySQL version equal or higher than 4. Below are some the boolean search features. Important: Your tables must be MyISAM tables if the boolean search should work. (they should be MyISAM when we created them)

Search for the words

apple banana
...find page that contain at least one of these words.
+apple +juice
...both words
+apple macintosh
...word "apple", but rank it higher if it also contain "macintosh"
+apple -macintosh
...word "apple" but not "macintosh"
+apple +(>pie <strudel)
..."apple" and "pie", or "apple" and "strudel" (in any order), but rank "apple pie" higher than "apple strudel".
apple*
..."apple", "apples", "applesauce", and "applet"

This will be familiar to you if you know MySQL Full text search. There is also help available directly on the search page, next to the search button.

The minimum length of a search term is 4, see MySQL restrictions below for details. (User defined) stopwords are not marked in the results and not used in the database query.

^Stopwords

You can add, update and delete your own stopwords.

Stopwords are words which will not be searched on your pages. This means that when using a stopword as a searchterm, it will not be marked as a searchterm in the results.

Stopwords are not case sensitive. This means that if you enter "Apple" in the stopwords section and the users searches for "apple" this word will be treated as a stopword.

Please note that there are MySQL restrictions as well!

^Logging

This was introduced in version 0.911. The administrator can define in the setup file whether and what search activity should be logged. All log entries are accompanied by a timestamp. The admin can decide to log the following: IP address, search term and clicks on the results.

The administrator may want to analyse what users are searching for on his site and make navigation to those points easier.

The administrator can also log the IP address of the person searching. Be aware that people might not like the idea of you "spying" on them. But we thought this might be a useful feature - maybe especially for Intranets. In those, if someone is totally lost the administrator can take him by his hand and help directly.

The administrator may want to notify the users if their actions are being loged, especially when logging their IP address.

For sorting the log entries by IP adddress, MySQL v3.23 or higher is required.

^Frequently Asked Questions

^Restrictions

MySQL restrictions

  1. When you want to order the results in your logview.php by IP address, MySQL v3.23 or higher is needed.
  2. There are certain MySQL restrictions to a full text search:

Quote:

Any word that is too short is ignored. The default minimum length of words that will be found by full-text searches is four characters.

Quote:

Words in the stopword list are ignored. A stopword is a word such as ``the'' or ``some'' that is so common that it is considered to have zero semantic value. There is a built-in stopword list.

For more details you might read on the source page of these quotes: 13.6 Full-Text Search Functions

The restrictions are covered on 13.6.3 Full-Text Restrictions

People with access to the MySQL server though can fine-tune their MySQL to overcome these restrictions. You find information about this on 13.6.4 Fine-Tuning MySQL Full-Text Search

More on built-in MySQL stopwords you will find when you search the MySQL page for "stopword list".

Personally I do not see the big problem about the built-in stopwords because they are so general that probably no one really trying to find something will enter "you" as a search word. Searching is nothing new to people so that they will enter words which they think match what they need best. This also comes down to that they will enter words which are probably long enough not to fall under the length restriction. Also those are English words and TSEP is now ready for other languages as well. (Olaf)

^'What Version am I running?'

The version of TSEP is included in the 'title' tag of the copyright notice. This means that you can move your cursor over the copyright notice (on the bottom of the search page for example) and after a little while your browser should display the version number.

The version number is read from a textfile in the include directory named tsepversion.txt. There is no need to change anything in this file: it is maintained by the programmers.

^Creating a new language

If you decide to create a new language please mail us the language.php file which you created, so that we can add it to the next version.

Language files define the PHP variables which are being used in the TSEP files. Place the language.php into a subdirectory of the language directory. Let's say you are creating a Spanish version:

  1. Locate the language directory
  2. Create a new directory in the language directory called "es" (for espaniol)
  3. Copy a language.php from another directory into the new "es" directory
  4. Start translating the strings in the language.php. The strings are located in quotes (") behind the equal sign (=). Please use HTML encoding of any special characters like ä which should be &auml; If you need quotes please preceed them with a \
  5. To use your newly created language, change the $tsep_config['config_Language'] in config.php to "es"
  6. After testing please mail us your language.php

^How to delete / change an entry in the index

Some people asked how they can delete a word from the index or correct a word. In version 0.910 we introduced the possibility to do this right from TSEP. Follow the link to the "Index Editing" on the indexer.php page.

But there are 2 other possibilities (old, but work):

  1. Simple: Just go to the indexer.php page of your site and call it again using the "?index" or on the top-right you can click on "create a new index". This will completely rewrite the index of your pages.
  2. More powerful: Use a tool to work directly with your database - like phpMyAdmin and do whatever you wish with your index (do not blame us when you mess up your database

^How can I change the filetypes TSEP indexes?

At this time you will have to change the code. We are planning to put this as a configuration possibility into the config.php file.

For now follow these steps to add more filetypes to TSEP to index. Please test your changes. You will not be able to index any binary data of course!

  1. Open the indexer.php file in your editor
  2. Search for
    ("(.)+\\.html$|htm$|php3$|php$|)",$entry))
  3. Change this line your needs.
    Examples
    1. If you want TSEP to also index all files with the .txt extension the resulting string would be
      ("(.)+\\.html$|htm$|php3$|php$|txt$|)",$entry))
    2. You want to exclude all .php files from indexing:
      ("(.)+\\.html$|htm$|)",$entry))

^What Does the "rank" of the pages mean?

Rank means that all pages are shown ordered by the number of hits they received by all search words. Example: You get 2 results after a search, on the page with rank 1 the search words were found more often than on the page with the rank 2 - simple but very useful if you have many pages on your site and the user might face lots of results.

^How can I change the look of TSEP to fit it best into my own layout?

This is simple but takes a little while. To make things as easy as we can, we will take a look on the result page step by step. The formating we show you here is from version 0.911. It might change in future but still be pretty much the same.

Please note that there are additional div-blocks in the search page. Those are only shown when errors occur (stopword was searched, MySQL version to low...) Therefore we leave it up to you for now to look deeply into these formattings and for the general users sake we stick with something most people will see.

If you have done some nice formating we would appreciate it if you could contact us and send us your CSS file so that we could include it in a new TSEP version.

All of TSEP - on all TSEP pages is in the following div container to provide a global area for TSEP.

div class tsepProject

With this knowledge already you can change the look very much, for example setting the .tsepProject class in the tsep.css file to another font. This will change all fonts in the TSEP area to whatever you define.

Now that you know the header, let's look on the next part of the search page: The .SearchBlock which contains the search form fields and the help - which as you can see has it's extra div container .SearchHintsHelp .

searchblock with div tags

This SearchBlock is being followed by another .SearchBlock which provides status information. This whole block is repeated at the bottom of all search results. If you know a little about CSS you should be able to format this block to fit your needs.

search status output with div tags

This first container of this type is followed by our search results. Here we use the following classes:

.SearchResultAllPagesBlock - this is the block of all the results.

.SearchResultOnePageBlock - this is a block of one resulting page.

.SearchResultOnePageTitle - this is the title of the webpage we found in the database.

.resultnumber - this is the rank of the page. (details: rank).

.SearchResultPageRank - displays how many times the page had a hit from the searchwords.

.SearchResultOutput - these are the words which we indexed - until we encounter the first "explode" character (a . (dot) right now).

.foundSearchWord - this is one of the words the user has searched. We can mark it special so that the user sees it faster.

.SearchResultOutputMore - these are the little dots which show the user there is more on the page.

.SearchResultURL - is the URL of the page we have found, extended by the size of the page (as written in the database).

search results and div tags used

^I get an error with "set_time_limit()"

"Warning: set_time_limit(): Cannot set time limit in safe mode in /..../tsep/admin/indexer.php on line 110"

This is nothing really important. It shows only in the admin area. The error occurs when the safe-mode on the server is on. No problems except this are know at this time (concerning the safe-mode).

^Known Problems

^MySQL

You might run into problems with MySQL v3.23 or lower. If you are running such a version we would be happy to hear if TSEP is working for you or and what kind of problems you have encountered.

Problem

It seems that with MySQL 5 alpha there are problems concerning the indexer.php. We will assume for now that is an issue of the new MySQL version.

Possible solution

Try populating the $db_table_prefix.config table by hand (using phpMyAdmin for example) with the values you find in the SQL dump.

^When trying the search I am getting an error

Symptom

Warning: array_multisort(): Array sizes are inconsistent in /srv/www/htdocs/blabla/php/tsepsearch/search.php on line 410

You will notice that the results are not sorted correctly.

Possible reason and solution

Some entries you made in the indexer (when creating a new index) are wrong. Please check and maybe index your site again. You can also look for indexed pages with zero (0) words in the index.

^Last notes

This software has been tested on Windows and Linux systems with Apache as web server running PHP v4.2 or greater and MySQL (v4 or greater for boolean capabilities). 'allow_url_fopen' option should be enabled for PHP.

^Credits & Contact

Please mail us any suggestions or questions you ay have or post them to the Sourceforge forums. We welcome any response. If you need help or "something does not work" please include the version number of TSEP you are using.

Software by: Olaf Noehring (main development since 0.9beta (excluding)) and Girish R (main development until 0.9beta (including))
Version: TSEP 0.9nnn
This file has been last modified on: 2004-09-01 9:38 AM by Olaf Noehring
Copyright (c) 2002-2004, Girish R & Olaf Noehring. All Rights Reserved.
Support & Info (Summary on Sourceforge): http://sourceforge.net/projects/tsep/
Contact: Olaf Noehring (email on website: http://www.team-noehring.de) or Girish R at: girishr at gmail.com with your comments, suggestions, enquires or requirements.

This file is part of TSEP (The Search Engine Project)

^Licensing agreement

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA