TSEP - The Search Engine Project

Contents

Author's Note

TSEP (The Search Engine Project) has been developed for several months now. Girish started this project as a requirement for a site which needed a search engine to search its pages which numbered up to 150 pages approximately. So he went thru the currently available open source search engines software, but was unable to find any that was easy to understand and setup for a web-master. Since then he has built and improved this software and will continue to do so. Olaf joined in on version 0.9 beta and did the (main) development and documentation since then.
The primary objective of this software is 'ease of use'.
By submitting this software to the open source community we can all hope that this software will one day become the most powerful personal site search engine in the world.
If you still think that this software is difficult to setup and / or use please let us know that, we will personally help you and also it will make us aware of the complexity part.

We think that we have made the copyright notice nice and small enough - even for your site. Please do not remove it and leave it visible at all times, so that hopefully others will discover what a great tool TSEP is.

We are very interested where TSEP is being used. Therefore we would really appreciate it if you would contact us to let us know the web address where we can take a look (and grab a screenshot) or, if it's an intranet if you could send us a screenshot.

A word about our versioning: We publish a new version when we think it's ready. A version number does not tell you much about any changes that we have done since it's previous version. We might add a 0.001 to a version number but still have made huge changes to TSEP. All in one sentence: We think it's worth downloading every new version as a change in the version number indicates something has happened to TSEP.

^The Search Engine Project Installation

Dear user,
Thank you for downloading TSEP (The Search Engine Project). We hope this manual will help you install this TSEP on your site in minutes! We know how frustrating it is to go thru a lengthy manual and end up understanding nothing. So we have kept everything to the minimum and simple. If you experience some problems during installation, don't hesitate to contact us, we would be pleased to help you.

^Upgrading from a previous version, skip to (First Time-) Installation

General:

Please make sure you are reading the right instructions (for your version).

^upgrading from 0.911

Follow the complete installation process below. Be aware that you will loose the stopwords you entered!

Make sure to run open the configuration.php in your browser before you index your site!

Run the SQL query as we added configuration to the internal table.

If you want to keep your stopwords, please open TSEP_mysql_table_dump.txt in your editor and search for # Tabellenstruktur für Tabelle `$db_table_prefix.stopwords`
and delete this line and all the rest from the file before running it through MySQL. Everything should be fine.

^upgrading from 0.910

Follow the complete installation process below. Be aware that you will loose the stopwords you entered!

If you want to keep your stopwords, please open TSEP_mysql_table_dump.txt in your editor and search for # Tabellenstruktur für Tabelle `$db_table_prefix.stopwords`
and delete this line and all the rest from the file before running it through MySQL. Everything should be fine.

^upgrading from 0.909

Follow the complete installation process below. Be aware that you will loose the stopwords you entered!

^upgrading from below 0.909

Follow the complete installation process below.

^Installation

  1. Unzip the downloaded TSEP files. Make sure that you have the folder structure correct.
  2. Open the config.php file in the include directory and set the values which are appropriate for your database.
    Changes: In version 0.912 we introduced a browser based configuration with the configuration.php file. Use this file later to configure all other values!
  3. Upload all files to your website, make sure that you have the folder structure correct.
  4. Assuming that the MySQL server is running, create a database for the search engine or change to an existing one. Copy the contents of TSEP_mysql_table_dump.txt and copy it the MySQL prompt and run the query.

The query will create the tables which are needed with appropriate starting values. Below you see a screenshot of the database modell which will be created.

Database model for TSEP

  1. Open the configuration.php in the admin directory and make all the changes to your likes .
  2. Do you want
    1. to use the pre-made search page?
    2. include TSEP into your layout?

    1. to use the pre-made search page 'tsepsearch.php', you are done with installation now and you can skip the next paragraph, continue with "Security"
    2. to include the search page in your own page / layout please follow these 2 simple steps - make sure that the path to the two files are set correctly!
      1. You need to make sure your page is being processed as a PHP page, ususally this is the case when your page has the extension .php or .php3 add the following code between your SEARCH page <head> ..and.. </head> tag - otherwise the search WILL look (very I may add) ugly! This example assumes that search.php is located in the same directory as your search page, tsep.css is in a subdirectory called "css".
        <link href="css/tsep.css" rel="stylesheet" type="text/css" />
      2. add the search.php into your page at the position you wish to like this:
        <?php require ('search.php'); ?>

      You are DONE now, you have integrated TSEP into your own page. Read Security now.

^Security

For security you might want to protect the include and the admin directory using .htaccess!

^First-Time-Preparation

Make sure you have set all the values to your need in the configuration.php file in the admin directory before you continue!

^Configuration

This was introduced in 0.912.
As mentioned before, you need to (you must) open the configutation.php directly after installing everything. You must set the correct values, especially for language and the TSEP path - and of course every other value.

^Indexing your Site

For populating the database with the values for the search engine to search for, you have to run the file 'indexer.php' (in the admin directory) with '?index' added to the end in the web browser address bar i.e. http://www.sitename.com/admin/indexer.php?index . Now input the details asked by the form. By submitting the form with the appropriate values you will see the results of the indexing after a few seconds. The script will provide a detailed information on the number of pages, the title, the URL, the size and the indexed words found by the indexing script. Also the entries you have made in the form are saved to the database for later re-use / re-indexing.

^Using TSEP

^Searching

Now its time to run a search. Open your search page or the page we prepared for you called 'tsepsearch.php' in your browser and input the words to be searched. The search words are not case sensitive.

TSEP supports boolean search if you have a MySQL version equal or higher than 4. Below are some the boolean search features

Search for the words

apple banana
...find page that contain at least one of these words.
+apple +juice
...both words
+apple macintosh
...word "apple", but rank it higher if it also contain "macintosh"
+apple -macintosh
...word "apple" but not "macintosh"
+apple +(>pie <strudel)
..."apple" and "pie", or "apple" and "strudel" (in any order), but rank "apple pie" higher than "apple strudel".
apple*
..."apple", "apples", "applesauce", and "applet"

This will be familiar it you know MySQL Full text search. There is also help available directly on the search page (next to the search button).

Minimum length of the search word is 4, see MySQL restrictions below for details. (User defined) stopwords are not marked in the results and not used in the database query.

^Stopwords

You can add, update and delete your own stopwords.

Stopwords are words which will not be searched on your pages. This means that if the user is trying to search for a stopword this will not be taken into account and marked as stopword in the area where the words the user has searched for are displayed.

Stopwords are not case sensitive. This means that if you enter "Apple" in the stopwords section and the users searches for "apple" this word will be recognized as a valid stopword and because of this will not be searched for and not be marked in the results.

Please note that there are MySQL restrictions as well!

^Logging

This was first introduced in versin 0.911. The admin can efine in the setup file if things should be logged at all and which. If something is written to the log always the time then will be written as well. The admin can decide to log the following: IP address, search term and clicks on the results.

^Why logging?

We thought when the administrator knows what people are searching for on his site he might make navigation to those points easier, put even more effort into the design and updating of those pages... Probably we can find many more good reasons.

You can also log the IP address of the person searching. Be aware that people might not like the idea of you "spying" on them. But we thought this might be a useful feature - maybe especially for Intranets. In those, if someone is totally lost the administrator can take him by his hand and help directly.

You might want to notify your users if you are logging their actions, especially if you are logging their IP address.

For sorting the log by IP adddresses MySQL >3.23 is needed!

^FAQ - Frequently Asked Questions

^MySQL restrictions

When you want to order the results in your logview.php by the IP address, a MySQL version >3.23 is needed.

There are certain MySQL restrictions to a full text search:

Quote:

Any word that is too short is ignored. The default minimum length of words that will be found by full-text searches is four characters.

Quote:

Words in the stopword list are ignored. A stopword is a word such as ``the'' or ``some'' that is so common that it is considered to have zero semantic value. There is a built-in stopword list.

For more details you might read on the source page of these quotes: 13.6 Full-Text Search Functions

The restrictions are covered on 13.6.3 Full-Text Restrictions

People with access to the MySQL server though can fine-tune their MySQL to overcome these restrictions. You find information about this on 13.6.4 Fine-Tuning MySQL Full-Text Search

More on built-in MySQL stopwords you will find when you search the MySQL page for "stopword list".

Personally I do not see the big problem about the built-in stopwords because they are so general that probably no one really trying to find something will enter "you" as a search word. Searching is nothing new to people so that they will enter words which they think match what they need best. This also comes down to that they will enter words which are probably long enough not to fall under the length restriction. Also those are English words and TSEP is now ready for other languages as well. (Olaf)

^'What Version am I running?'

The version you are running is written into the 'title' tag of the copyright notice. (Please remember that we ask you to leave this notice visible) This means that you can move your mouse cursor over the copyright notice (on the bottom of the search page for example) and after a little while your browser should display the text we provide in the 'title' tag.

This information (the version number) is read from a simple textfile in the include directory named tsepversion.txt. There is no need for you to change anything in this file yourself. It is frequently updated by the programmers.

^Creating a new language

If you decide to create a new language please mail us the language.php file which you created, so that we can add it to the next version.

Language files are quite simple. They define PHP variables which are being used in the TSEP files. Place the language.php into a subdirectory of the language directory. Let's say you are creating a Spanish version:

  1. Locate the language directory
  2. create a new directory in the language directory called "es" (for espaniol)
  3. copy a language.php from another directory into the new "es" directory
  4. start translating the strings in the language.php. The strings are located in quotes (") behind the equal sign (=). Please use HTML encoding of any special characters like ä which should be &auml; If you need quotes please preceed them with a \
  5. to use your newly created language, change the $tsep_config['config_Language'] in config.php to "es"
  6. after testing please mail us your language.php

^How to delete / change an entry in the index

Some pleople asked how they can delete a word from the index or correct a word. In version 0.910 we introduced the possibility to do this right from TSEP. Follow the link to the "Index Editing" on the indexer.php page.

But there are 2 other possibilities (old, but work):

  1. Simple: Just go to the indexer.php page of your site and call it again using the "?index" or on the top-right you can click on "create a new index". This will completely rewrite the index of your pages.
  2. More powerful: Use a tool to work directly with your database - like phpMyAdmin and do whatever you wish with your index (do not blame us when you mess up your database

^How can I change the filetypes TSEP indexes?

At this time you will have to change the code. We are planning to put this as a configuration possibility into the config.php file.

For now follow the following steps to add more filetypes to TSEP to index. Please try if your changes work. You will not be able to index any binary data of course!

  1. Open the indexer.php file in your editor
  2. search for
    ("(.)+\\.html$|htm$|php3$|php$|)",$entry))
  3. Change this line your needs.
    Examples
    1. If you want TSEP to also index all files with the .txt extension the resulting string would be
      ("(.)+\\.html$|htm$|php3$|php$|txt$|)",$entry))
    2. You want to exclude all .php files from indexing:
      ("(.)+\\.html$|htm$|)",$entry))
^

What Does the "rank" of the pages mean?

Rank means that all pages are shown ordered by the number of hits they received by all search words. Example: You get 2 results after a search, on the page with rank 1 the search words were found more often than on the page with the rank 2 - simple but very useful if you have many pages on your site and the user might face lots of results.

^How can I change the look of TSEP to fit it best in my own layout?

This is simple but takes a little while. To make things as easy as we think we can for you we will take a look on the result page step by step. The formating we show you here is from version 0.911. It might change in future but still be pretty much the same.

Please note that there are additional div-blocks in the search page. Those are only shown when errors occur (stopword was searched, MySQL version to low...) Therefore we leave it up to you for now to look deeply into these formattings and for the general users sake we stick with something most people will see.

If you have done some nice formating we would appreciate it if you could contact us and send us your CSS file so that we could include it in a new TSEP version.

All of TSEP - on all TSEP pages is in the following div container to provide a global area for TSEP.

div class tsepProject

With this knowledge already you can change the look very much, for example setting the .tsepProject class in the tsep.css file to another font. This will change all fonts in the TSEP area to whatever you define.

Ok, now that you know the header lets look on the next part of the search page: The .SearchBlock which contains the search form fields and the help - which as you can see has it's extra div container .SearchHintsHelp .

searchblock with div tags

This SearchBlock is being followed by another .SearchBlock which provides status information. This whole block is repeated at the bottom of all search results. If you know a little about CSS you should be able to format this block to fit your needs.

search status output with div tags

This first container of this type is followed by our search results. Here we use the following classes:

.SearchResultAllPagesBlock - this is the block of all the results

.SearchResultOnePageBlock - this is a block of one resulting page

.SearchResultOnePageTitle - this is the title of the webpage we found in the database

.resultnumber - this is the rank of the page. (details: rank)

.SearchResultPageRank - displays how many times the page had a hit from the searchwords.

.SearchResultOutput - these are the words which we indexed - until we encounter the first "explode" charcter (a . (dot) right now)

.foundSearchWord - this is one of the words the user has searched. We can mark it special so that the user sees it faster.

.SearchResultOutputMore - these are the little dots which show the user there is more on the page

.SearchResultURL - is the URL of the page we have found, extended by the size of the page (as written in the database).

search results and div tags used

^Known Problems

^MySQL

You might run into problems with old (<3.23) versions of MySQL. If someone of you is running such an old version we would be happy to hear if TSEP is working for you or and if what kind of problems you encounter.

Problem:

It seems that with MySQL 5 alpha there are problems concerning the indexer.php. We will assume for now that is an issue of the new MySQL version.

Possible solution:

Try feeding the $db_table_prefix.config table by hand (using phpMyAdmin for example) with the values you find in the SQL dump.

^Last notes

This software has been test on windows and linux systems with Apache as server running PHP 4.2 or greater and MySQL (4 or greater for boolean capabilities). 'allow_url_fopen' option should be enabled for php.

^Credits & Contact

Software by : Olaf Noehring (main development since 0.9beta (excluding)) and Girish R (main development until 0.9beta (including))
Version : TSEP 0.9nnn
This file has been last modified on: 2004-07-19 11:47 AM by Olaf Noehring
Copyright (c) 2002-2004, Girish R & Olaf Noehring. All Rights Reserved.
Support & Info (Summary on Sourceforge): http://sourceforge.net/projects/tsep/
Contact: Olaf Noehring (email on website: http://www.team-noehring.de) or Girish R at: girishr at gmail.com with your comments, suggestions, enquires or requirements.

This file is part of TSEP (The Search Engine Project)

^License

We think that we have made the copyright notice nice and small enough - even for your site. Please do not remove it and leave it visible at all times, so that hopefully others will discover what a great tool TSEP is.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA