<Каталог сайтов - Найдётся всё yandex-rambler.ru Приглашение!

Искать на Озоне

Chapter 10 -- Databases

Chapter 10 Databases

Database Interfaces
- Front End CGIs
Web Indexing
- Indexers
- Search Engines
Large Scale Databases
Summary

Databases have been intimately connected with the World Wide Web and CGI ever since the inception of the Web. In fact, the Web itself is an immense worldwide database, a collection of data and resources accessible at the click of a mouse.

On a more mundane level, interaction with server-side databases is one of the most natural applications of CGI. The end user can submit a query through a form and have the results displayed directly back to his or her browser.

Because real-time interaction is not usually needed for database interaction, one of the major drawbacks of CGI (lack of persistent connection) is avoided. In addition, because only the results of the query are sent to the client, the size of the database does not factor greatly in the speed of the transaction. (Anyone who has used one of the Web searchers such as AltaVista or Lycos can attest to this.)

Although any kind of database can be accessed via CGI, there are a few types of databases that have become very popular on the Web:

Small Text Databases-These are the easiest to create and are useful for many small- to medium-sized tasks. These databases are simply ASCII files with delimited or fixed-length records. Address books, schedules, and other such databases with limited number of entries are suited well for this method.
Databases of the Web-A natural candidate for Web databases is the Web itself. Services that catalog large portions of the Web (along with Usenet, Gopher, and so on) are popping up with great frequency, as covered elsewhere in this book. This can also be applied to intranets and single machines. Indexing software like freeWais or Ice can create a database of an entire site, which can then be accessed through CGI programs.
Large Database Servers-Large databases are generally stored in a database server of some sort (even if the server is also a client, such as most common Windows-based databases). As long as the server has some method of interface with other programs, the information in these databases can be accessed from a CGI program.

In practice, Web-based databases can be some combination of these. Each database has its own needs, and CGI programs must often be customized to suit the needs of your particular database.

Database Interfaces

To effectively and seamlessly merge your database with the Web, CGIs must be used in both the front and back end of the database interaction.

Front End CGIs

The first thing that must be considered is how the user is going to enter queries to the database. An HTML form is the most common way for the user to submit information, although there are other ways. As an example, consider an interface to an address book. A simple form could look like this:

<HTML><HEAD><TITLE>My Address Book</title></head> <BODY> <H2>Welcome to my address book</h2> To find addresses that match a certain category, fill in that category and then press 'submit'. <FORM ACTION="address.cgi" METHOD="POST"> Name: <INPUT SIZE=45 name="name"><br> Phone: <INPUT SIZE=45 name="phone"><br> Street Address: <INPUT SIZE=45 name="street"><BR> City: <INPUT SIZE=20 name="city"> State: <INPUT SIZE=3 name="state"> Zip: <INPUT SIZE=6 name="zip"><br> <INPUT TYPE=SUBMIT Value=" Submit Query "> <INPUT TYPE=RESET Value=" Reset Form "><br> </body></html>

This form calls the CGI program address.cgi. This is the front end to the database interaction. The purpose of the front end is to collect the data from the form, parse it, and somehow pass the query to the database. If the database is an ASCII text file, the front end is also the middle end and the back end. It must do the searching, interpret the data, and then pass the results back to client. For database servers (including Web indexes like freeWais and Ice), the front end must put the query into a form that the server will understand, and then pass the query to the server. A back end CGI program must then retrieve the results and pass them to the user. Very often in this case, the front and back ends are contained in the same program. On systems that support it (UNIX, Amiga, and others), this can be accomplished with process forking. In the Windows environment, special applications that take advantage of OLE or some other type of inter-application communication is necessary.

To go back to the address book example, we can now construct the CGI program that will answer the client's request.

First, we need to know the format of the database itself. A small text-based format is sufficient for our needs. We'll use delimited records, although fixed-length records would also work. An example record follows:

0:Elmer J. Fudd:555-1234:42 Jones Lane:Chuckville:CA:90210

This format will be familiar to anyone who has ever seen a UNIX password file. There are two drawbacks to this format. The total of all fields cannot exceed any line length limitations on whatever system you are using (in our case, this should not be a problem). Also, our delimiter (a colon) should not appear in any field, or it will look like the start of a new field. In an ideal world, this should not be a problem for us (unless someone lived in "New York: The City"). But in reality, people make typos or are just plain malicious. Therefore, we must be aware of this potential problem.

Now that we know the form of the database, we can begin the CGI program to gather the information from the form. Any language can be used to write CGIs, but in this example, we'll use Perl for its text-handling capabilities.

#!/bin/perl require cgi_head; # Get form data and print header.

Tip

In all Perl CGIs in this chapter, a module called cgi_head.pm is used to gather the information from the form and print the required HTML header. This module places a form entry with name 'foo' into an associative array entry with name $FORM{'foo'}. There are several freely available programs for several languages to accomplish this, including CGI.pm for Perl at

http://www.perl.com/perl/CPAN/

Now that form data has been read in, we must read in the database itself. Because we are using a delimited database, it is easiest to read in the entire database. A fixed-length field database would enable us to move through the database without reading the entire thing, but that method has its own drawbacks (the most obvious being that the records must not exceed a fixed length). We read in the database as a flat ASCII file and parse it line by line using the handy Perl while(<FILEHANDLE>)<> construct.

Caution

This example required Perl 5.001 or above because of its use of references that were not included in Perl 4 (or earlier). Perl 5 contains many enhancements and new features and is a must for any Perlphile. It is available at

http://www.perl.com/perl/CPAN/

# First, open the database. (Which is called 'database.txt' here.) open (DAT, "database.txt") || die "Can't open the database: $! !.\n"; $maxn=0; # A counter for the number of entries. while (<DAT>) { chop; @field=split(/:/); # Split the line into the data fields. $n=$field[0]; # First field is an id number $add[$n]{'name'}=$field[1]; # Then the name $add[$n]{'phone'}=$field[2]; # The phone number $add[$n]{'street'}=$field[3]; # The street address $add[$n]{'city'}=$field[4]; # The city $add[$n]{'state'}=$field[5]; # The state $add[$n]{'zip'}=$field[6]; # The Zip Code } $maxn=$n # Set the max number to the last entry

Now that the database has been loaded, we need to compare the user's query with the data, as shown in Listing 10.1.

Listing 10.1. Searching an ASCII database.

@results=(); # Zero out an array to hold the results. if ($name=$FORM{'name'}) { # If the client wanted to search a name, for ($I=0; $I <=$maxn; $I++) { # Go through each entry if ($name eq $add[$I]{'name'}) { # Looking for a match. push(@results,$I); # If one is found, add its id } # Number to the list of results. } if (!@results) { &exitnone; } # If no match is found, exit. } # Now repeat for each criteria. If there are results from a previous # match, search them instead, and remove any entries that don't match. if (($phone=$FORM{'phone'}) && !@results) { for ($I=0; $I <=$maxn; $I++) { if ($phone eq $add[$I]{'phone'}) { push(@results,$I); } } if (!@results) { &exitnone; } } elsif ($phone=$FORM{'phone'}) { @r2=@results; foreach $I (@r2) { if ($phone ne $add[$I]{'phone'}) { @results=grep(!/$I/,@results); } } if (!@results) { &exitnone; } } if (($street=$FORM{'street'}0 && !@results) { for ($I=0; $I <=$maxn; $I++) { if ($street eq $add[$I]{'street'}) { push(@results,$I); } } if (!@results) { &exitnone; } } elsif ($street=$FORM{'street'}) { @r2=@results; foreach $I (@r2) { if ($street ne $add[$I]{'street'}) { @results=grep(!/$I/,@results); } } if (!@results) { &exitnone; } } if (($city=$FORM{'city'}) && !@results) { for ($I=0; $I <=$maxn; $I++) { if ($city eq $add[$I]{'city'}) { push(@results,$I); } } if (!@results) { &exitnone; } } elsif ($city=$FORM{'city'}) { @r2=@results; foreach $I (@r2) { if ($city ne $add[$I]{'city'}) { @results=grep(!/$I/,@results); } } if (!@results) { &exitnone; } } if (($state=$FORM{'state'}) && !@results) { for ($I=0; $I <=$maxn; $I++) { if ($state eq $add[$I]{'state'}) { push(@results,$I); } } if (!@results) { &exitnone; } } elsif ($state=$FORM{'state'}) { @r2=@results; foreach $I (@r2) { if ($state ne $add[$I]{'state'}) { @results=grep(!/$I/,@results); } } if (!@results) { &exitnone; } } if (($zip=$FORM{'zip'}) && !@results) { for ($I=0; $I <=$maxn; $I++) { if ($zip eq $add[$I]{'zip'}) { push(@results,$I); } } if (!@results) { &exitnone; } } elsif ($zip=$FORM{'zip'}) { @r2=@results; foreach $I (@r2) { if ($zip ne $add[$I]{'zip'}) { @results=grep(!/$I/,@results); } } if (!@results) { &exitnone; } }

At this point, either we have successful matches that are stored in the array @results, or we have no matches, in which case we call the &exitnone subroutine. Now we can give the client the results (or lack thereof).

# If there are no matches, print a note then die. sub exitnone { print <<EOE; <HTML><HEAD><TITLE>No matches</title></head> <BODY> <h3>There were no matches that fit your criteria.</h3> <A HREF="addrbk.html">Go</a> back to the form to try again. </body></html> EOE die; } # Print all the fields of each match. print <<EOP; <HTML><HEAD><TITLE>Search Results</title></head> <BODY> <h3>The entries that matched your search</h3> <pre> EOP foreach $r (@results) { print <<EOG; ---- Name: $add[$r]{'name'} Phone: $add[$r]{'phone'} Address: $add[$r]{'street'} $add[$r]{'city'}, $add[$r]{'state'} $add[$r]{'zip'} EOG } print <<EOH; </pre><br> Thank you for using my address book. <A HREF="addrbk.html">Go</a> back to the form to make another search. </body></html> EOH

Now we have a working front end to an address book. There are several optimizations that could be made, but it runs quite well for a few dozen lines. Note that this script, as is, only does a Boolean AND search on all fields. It would be possible to make it an OR search by removing all calls to &exitnone except for the last one. This way, when the program does not find any matches, it will not die but move on to the next field. It would also be possible to enable the end user to choose whether to do an AND or OR search by adding a pull-down menu to the form page. Then the CGI could exit or not depending on the choice.

Now that the user can search your database for any number of criteria; the next logical question is how to add or remove information to the database. You could, of course, do this by hand, but it would be advantageous to allow direct manipulation of the database from the Web itself. Fortunately, this is not hard.

In the manipulation of the database itself is where the difference between delimited and fixed-length record databases becomes important. With delimited text, you have no easy way of knowing where one record ends and another begins. Therefore, to change or delete one record it's necessary to rewrite the entire database. So in small databases, this is not really a big performance hit. If your database is large enough that this becomes a problem, it would probably be a good idea to look into using a database server.

With fixed-width field databases, however, it's not necessary to rewrite the entire database to change a record. Because the length of each record is known, functions like seek() and tell() (or their equivalent in your preferred language) can be used to write over only a portion of the file, changing or deleting records.

Caution

Enabling users to write to files is one of the most dangerous undertakings on the Web. Most Web servers are run as user 'nobody' (on systems that have distinct users). This means that the server has no special permissions to write to any file. To be accessible to a CGI script, a file must be world writeable, meaning that anyone with access to the server machine can modify the file in any way (including erasing it entirely). If you trust everyone on your machine (or you are the only user), this may not be a terrible problem. Because the name of the database file is not visible from the Web, you could hide it in some far-out directory with an unusual name, thereby providing "security by obscurity." There are other solutions, however. A module exists for Perl called CGIWrap (http://wwwcgi.umr.edu/~cgiwrap), and similar modules exist for other languages. CGIWrap will execute your CGI program "setuid owner." That is, the program runs as if it were executed by the user who owns the program. This allows you to remove write privileges for everyone but yourself. Be aware, however, that the program can now modify any file in your directory as if it were you. Therefore, it is wise to make very sure that your programs are secure. (For Perl users, try running the script with the -Tw switch.)

As with the front end to the searching CGI, a simple HTML form is all that is required to enable users to directly modify your database. Here again is our address book example:

<HTML><HEAD><TITLE>My Address Book</title></head> <BODY> <h4>Fill out the form below to add an entry to the address book</h4> <FORM ACTION="add.cgi" METHOD="POST"> Name: <INPUT SIZE=45 NAME="name"><br> Phone: <INPUT SIZE=45 NAME="phone"><br> Street: <INPUT SIZE=45 NAME="street"><br> City: <INPUT SIZE=20 NAME="city"> State: <INPUT SIZE=3 NAME="state"> Zip: <INPUT SIZE=6 NAME="zip"> <br><br> <INPUT TYPE=SUBMIT VALUE=" Add Entry "> <INPUT TYPE=RESET VALUE=" Reset Form "> </form></body></html>

This form is almost identical to the one we made for searching. The difference comes in how the data is treated by the CGI program. In this case, the CGI script for adding an entry is actually much simpler than the searching script. In this case, we will assume that the database is a world writeable file.

#!/bin/perl require cgi_head; # Set up the CGI environment while (-e "datalock") { sleep 1; } # If database is currently being # modified, wait. system("touch datalock"); # Lock database open (DAT, "database.txt"); # open the database for reading while (<DAT>) { $line=$_; } # Read the last line of the database close DAT; if ($line=~ /:/) { @field=split (/:/, $line); $num=$field[0]; # Get last ID number $num++; } else { $num=0; } # Create new ID number open (DAT, ">>database.txt"); # open the database for appending # Add entry to database print DAT "$num:$FORM{'name'}:$FORM{'phone'}:$FORM{'street'}:$FORM{'city'}:$FORM{'state'}: Â$FORM{'zip'}\n"; close DAT; system ("rm datalock"); print <<EOF; <HTML><HEAD><TITLE>Addition Successful</title></head> <BODY> <h4>Your entry has been added to the address book</h4> <A HREF="add.html">Go</a> back to the form to add another user. </body></html> EOF

In effect, this CGI script simply appends the new entry to the database. The first snag becomes file locking. If someone else is modifying the database at exactly the same time, one of the changes will be lost or the entire database will become corrupted. To circumvent this, we use a lock file to tell if someone else is writing to the database. This is far from the most elegant solution, and most systems provide a flock() function to more effectively lock the file from simultaneous access. Secondly, the ID number of the entry must be determined. In this case, we can assume that the entries will be sequentially numbered and that the last entry will have the last ID number. So we simply read the last line of the database, grab the ID number from that, and then increment it to obtain the new ID number.

Now that anyone can add entries to the address book, it may become necessary to delete or modify entries. To do that, however, there must be some way for the user to indicate the desired entry to modify or delete. Instead of creating a whole new form for this, we can add this functionality to our existing search CGI. If the user's search returns exactly one result, a line can be added to the HTML result page offering the option to modify or delete this entry. (This could be done for more than one result fairly easily, but we will stick with one for brevity's sake.) This can be done by changing the following lines at the bottom of the search CGI:

print <<EOH; </pre><br> Thank you for using my address book. <A HREF="addrbk.html">Go</a> back to the form to make another search. </body></html> EOH

to:

print "</pre><br>\nThank you for using my address book.\n"; print "<A HREF=\"addrbk.html\">Go</a> back to the form to make another search.<br>\n"; if ($#results==0) { print "<A HREF=\"change.cgi?a=d&n=$result[0]\">Delete</a> this entry.<br>\n"; print "<A HREF=\"change.cgi?a=c&n=$result[0]\">Modify</a> this entry.<br>\n"; } print "</body></html>\n";

The added lines print links to a new CGI program, passing two values: a parameter indicating whether a deletion or a modification is wanted, and the ID number of the entry to delete or modify.

Because our database is delimited, we will have to regenerate the entire database to make a change, as shown in Listing 10.2.

Listing 10.2. Outputting an ASCII database.

#!/bin/perl require cgi_head; # Set up CGI environment while ( -e "datalock" ) { sleep 1; } # Wait while someone else is # modifying database. system ("touch datalock"); # Lock the database. # Load database open (DAT, "database.txt") || die "Can't open the database: $! !.\n"; $maxn=0; # A counter for the number of entries. while (<DAT>) { chop; @field=split(/:/); # Split the line into the data fields. $n=$field[0]; # First field is an id number $add[$n]{'name'}=$field[1]; # Then the name $add[$n]{'phone'}=$field[2]; # The phone number $add[$n]{'street'}=$field[3]; # The street address $add[$n]{'city'}=$field[4]; # The city $add[$n]{'state'}=$field[5]; # The state $add[$n]{'zip'}=$field[6]; # The Zip Code $add[$n]{'line'}=$_ . "\n"; # The entire line } $maxn=$n; close DAT; open (DAT, ">database.txt"); # Open database for writing. if ($FORM{'a'} eq "d") { # If a deletion is being requested, for ($I=0; $I <=$maxn; $I++) { #print all entries except the unless ($I==$FORM{'n'}) { # one to be deleted. print DAT $add[$I]{'line'}; } } # Print a message then exit. print <<EOP; <HTML><HEAD><TITLE>Request successful</title></head> <BODY> <H3>The selected entry has been deleted.</h3> <A HREF="addrbk.html">Go</a> back to make another search. </body></html> EOP close DAT; system ("rm datalock"); die; } elsif ($FORM{'a'} eq "c") { # If the user wants to modify the entry, things become a bit trickier. # We must first print out a form, similar to the original form, to allow # the user to change the values of the entry. $n=$FORM{'n'}; # Put the entry to be changed in an easier to type # variable. print <<EOF; <HTML><HEAD><TITLE>Entry Modification</title></head> <BODY> <h4>Make the desired changes in the form below.</h4> <FORM ACTION="change.cgi" METHOD="POST"> <INPUT TYPE=HIDDEN NAME="a" VALUE="m"> <INPUT TYPE=HIDDEN NAME="n" VALUE="$n"> Name: <INPUT SIZE=45 NAME="name" VALUE="$add[$n]{'name'}"><br> Phone: <INPUT SIZE=45 NAME="phone" VALUE="$add[$n]{'phone'}"><br> Street: <INPUT SIZE=45 NAME="street" VALUE="$add[$n]{'street'}"><br> City: <INPUT SIZE=20 NAME="city" VALUE="$add[$n]{'city'}"> State: <INPUT SIZE=3 NAME="state" VALUE="$add[$n]{'state'}"> Zip: <INPUT SIZE=6 NAME="zip" VALUE="$add[$n]{'zip'}"> <br><br> <INPUT TYPE=SUBMIT VALUE=" Modify Entry "> <INPUT TYPE=RESET VALUE=" Reset Form "> </form></body></html> EOF # This form adds two hidden fields, telling this CGI which entry to # modify. for ($I=0; $I <=$maxn; $I++) { print DAT $add[$I]{'line'}; } close DAT; system ("rm datalock"); die; } elsif ($FORM{'a'}="m") { # Make the change on the modified entry. $n=$FORM{'n'}; # Copy the entry to be changed into a more # typeable variable. # Assign the modified values to the entry. $add[$n]{'name'}=$FORM{'name'}; $add[$n]{'phone'}=$FORM{'phone'}; $add[$n]{'street'}=$FORM{'street'}; $add[$n]{'city'}=$FORM{'city'}; $add[$n]{'state'}=$FORM{'state'}; $add[$n]{'zip'}=$FORM{'zip'}; $add[$n]{'line'}= "$n:$add[$n]{'name'}:$add[$n]{'phone'}:$add[$n]{'street'}:$add[$n]{'city'}:$add[$n]{'state'}: Â$add[$n]{'zip'}\n"; for ($I=0; $I <=$maxn; $i++) { print DAT $add[$i]{'line'}; } close DAT; print <<EOE; <HTML><HEAD><TITLE>Modification successful</title></head> <BODY> <H4>The requested entry has been modified.</H4> <A HREF="addrbk.html">Go</a> back to the form to make another search. </body><//html> EOE system ("rm datalock"); die; } else { die; } # This should never be reached.

Now we have a complete address book system in which entries can be added, deleted, modified, and searched in any number of categories. As it is, though, this address book is lacking in several important areas:

Clean Code-For readability's sake, the code in this example is rather naive. There are several sections that could be optimized, and the entire thing could benefit from a more object-oriented approach. Object orientation is one of the most important things to strive for when developing a CGI program (or any program designed for use on the Internet). Unless your code must be proprietary, there is a good chance that it will travel far and wide across the Web and will be used by many different people in many different ways. A modular approach allows people to reuse portions of the code that are useful to them and makes understanding the structure of the program much easier.
User Interface-The HTML pages generated by these CGI scripts are Spartan to the extreme. A well-thought-out design can make your CGI more pleasant to use, encouraging people to come back. It is important, however, not to take it too far. If your database is for business or academic purposes, the Web pages should reflect that. Also make sure (especially if you are in an academic environment) that your scripts are usable by as many people as possible. Many people are still connecting to the Internet through modems 14.4 Kbps and slower. Large graphics, Java applets, or other needless bells and whistles can make your page worthless to a large potential audience. And if at all possible, make the page usable from a line-mode browser such as Lynx. Such browsers are the last refuge for people with very slow connections, and are the only means of access for the visually impaired and other physically challenged people.
Functionality-While it's important not to make your CGI programs too complicated to use, it helps to anticipate the needs of your users and to include as much functionality as possible. In our address book, for example, a better search engine would greatly improve its usefulness. As it stands, the script returns only entries that match exactly with database entries. If the scripts could do substring matching, a person could search for all entries with a common last name or all people who have the same telephone prefix. With Perl's pattern matching and regular expression capabilities, this would not be a difficult addition.

Web Indexing

Somewhere between small text databases and large database servers are databases that contain information about the Web itself. Such databases provide for users the capability to search for information on a site without having to look at every page by hand. The most common modus operandi of a Web index is as follows: A user enters one or more keywords into an HTML form; the search engine gathers the URLs of pages that match the keywords; the results are returned to the user weighted by some sort of scoring mechanism.

Indexers

The first step in putting a searchable index of information on the Web is generating that index. A number of freely available packages exist on the Internet to do just that, including Wais, Swish, Ice, and Glimpse.

Wais

Probably the most common Web indexer in use today (which predates the existence of the Web) is Wais (or freeWais or freeWais-sf). Wais was originally developed by Wais, Inc. (now owned by America OnLine). The most recent development on Wais has branched off into a freely redistributable version called freeWais and an enhanced version called freeWais-sf. Information about Wais in general (and freeWais-sf in particular) is available at

http://l6-www.informatik.uni-dortmund.de/freeWAIS-sf/

Source code is available from

ftp://ftp.germany.eu.net/pub/infosystems/wais/Unido-LS6/

Wais was designed as an all-purpose text indexer but is very useful at indexing HTML and other Web-related documents.

Installing freeWAIS-sf creates several programs, including waisserver, waissearch, waisq, and waisindex. Waisserver is a daemon that accepts requests from any machine on the Internet, processes queries, and returns information on the requested documents with weighted scoring information. Waissearch is a client used to connect to waisservers across the Internet. Waisq is a client for use on a local server. Waisindex is the actual index program. It takes a list of files and generates a database containing all the words on the files, sorted and weighted by a number of criteria. At the current time, indexes generated by waisindex are about twice the size of the original documents.

Swish

Swish was developed by Kevin Hughes of EIT. It is available at

http://www.eit.com/goodies/software/swish/

Swish was designed from the ground up as an HTML indexer. It is not (nor does it claim to be) as complex or full-featured as Wais, but it is much smaller, simpler to install, and easier to maintain. Both the indexer and the search engine are in the same program. Also, because it was designed for the Web, Swish is able to take into account HTML tags, ignoring the tags themselves and giving higher precedence to text within certain tags (like headers). One of the most noticeable drawbacks of Swish is that it does all of the indexing in RAM. So the total of all the files you wish to index cannot exceed your RAM (Wais offers a maximum RAM switch with its indexer). However, unless you have a very large site (say, over 30 MB of files on a 32 MB machine), this should not be a problem.

Ice

Ice is a Web indexing program written entirely in Perl. It uses a very simple indexing format that becomes slow with large numbers of documents but is very fast and efficient for sites with up to a couple of thousand files. Ice was created and is maintained by Christian Neuss and is available at

http://www.informatik.th-darmstadt.de/~neuss/ice/ice.html

Ice also supports a thesaurus file, which allows for synonyms and abbreviations while searching.

Glimpse

Glimpse is a fairly new entry in the indexer wars, having just now gained widespread attention as the default search engine of the Harvest system. Glimpse is similar to Wais in that it builds as several executables and offers many options when searching. Glimpse also appears to be highly intuitive, with most of its advanced searching options accessible with a simple command-line switch. It is being developed at the University of Arizona and is available at

http://glimpse.cs.arizona.edu:1994/

Search Engines

Once the index of files exists on your server, the next step is providing a way for users to access this from the Web. This is where CGI comes in. A CGI program must take a set of keywords (or some other sort of query) from a form, pass it to the search engine, and then interpret the results. Because all the work is done by the indexer/search engine, this front end can be fairly simple. Not coincidentally, there are dozens of them available on the Net, and it is not a major task to customize one for your own use.

Wais Front Ends

Due to the popularity of Wais, interfaces between it and the Web are very common. A Perl interface (WAIS.pm) is standard with certain releases of freeWAIS-sf. Another Perl front end (wais.pl) comes with ncSA httpd. A list of other interfaces between Wais and the Web can be found on Yahoo at

http://www.yahoo.com/Computers_and_Internet/Internet/Searching_the_Net/WAIS/

Other Front Ends

Several front ends exist for the other search engines, as well. Ice comes with its one CGI program (ice_form.pl). WWWWAIS is a program by the maker of Swish that serves as a front end to both Wais and Swish indexes. It is available at

http://www.eit.com/goodies/software/wwwwais/

Harvest is an ambitious set of tools developed by Colorado University, Boulder, which hopes to provide a central package to "gather, extract, search, cache, and replicate" information across the Internet. Harvest uses Glimpse as its default search engine. It is available at

http://harvest.cs.colorado.edu/harvest/

Rolling Your Own

With a little thought and effort, it is not hard to create your own custom front end for an existing search engine. A few things must be considered:

Getting the information from the form-This is the easy part. As mentioned earlier in the chapter, there are packages for all CGI languages that serve to retrieve information from a form and store it in variables of some kind. In this case, the information will be a list of keywords and perhaps some constraints such as a Boolean AND or OR search, a maximum number of results to return, or a specific index to search.
Parsing the information-Before passing it to the search engine, the data must be put in the right form (usually as command line arguments). Also at this stage, simple error detecting can be performed. Checks should be made that the user entered all necessary data in the form.

Caution

At this point, the program should also check to make sure that the user is not trying to pull a fast one. In the next step, an external program is called, so care must be taken to prevent the infamous keyword; rm -rf / trick. Almost universally, a semicolon is a command separator, and so a wannabe attacker could insert one into his or her query, followed by his or her malicious command(s).

Don't fall into this trap.

Calling the search engine-Now that all of the information has been verified as safe and is in the correct format, it must be passed to the search engine. Using UNIX-derived languages (such as C/C++ and Perl), this is most effectively accomplished by using process pipes. Consider the following snippet of Perl code that takes the prepared information, passes it to the Wais search engine, and then reads the output:

pipe(P0R,P0W); # Creates one read/write pipe pipe(P1R,P1W); # Creates another read/write pipe if ($pid=fork) { # This created a new process, # This is the parent process close(P0R); # Close the read end of the first pipe close(P1W); # and the write end of the other one &read_from_wais(P1R); # This calls a subroutine which is fed input # into P1R. It then interprets it into search results. } elsif (defines $pid) { # This is the child close(P0W); # Close the write end of the first pipe close(P1R); # Close the read end of the second pipe open(STDIN, "<&P0R"); # Duplicate P0R as the standard input open(STDOUT, ">&P1W"); # Duplicate P1W as the standard out # Now the standard output will travel through P1W into P1R which # is being held by the parent who sends it off to the subroutine. exec(@argline) || die; # @argline holds the command to execute the # Wais search engine # At this point the child dies } else { die("Can't fork!"); } # This is only reached if fork() # fails # Parent now continues with any information retrieved from the # search engine.
Manipulating pipes and forks can be tricky at first, but it greatly increases the power of interprocess communication, which is necessary to interact with an external search engine.

Large Scale Databases

At some point, you may encounter a project that is simply too big for a text-based database and is not suited for a text-indexing system. Fear not; others have been down this road and fortunately have left a lot of software behind to help integrate large database servers with the Web. A "large scale" database need not be large, per se. It is simply any database that is not a flat ASCII file. Popular commercial databases apply, such as dBASE, Paradox, and Access (although they are all able to read ASCII files, it is just not their preferred method of storing information). Also fitting this category are database servers such as Sybase, Oracle, and mSQL.

When dealing with a large-scale database, the trick is not in storing or manipulating the data as it is with the text database. The database server does all that work for you. The trick is communicating with the database server. There are almost as many database communication protocols as there are databases, despite the existence of some very complete standards (such as SQL). Programs exist for practically every database that has communications capabilities to interface with the Web. A list of some programs follows:

Tip

Much of the information that follows can be found online (in, no doubt, an updated form) at Jeff Rowe's excellent page

http://cscsun1.larc.nasa.gov/~beowulf/db/all_products.html

4D
NetLink/4D (http://www.fsti.com/productinfo/netlink.html)-This is a commercial product for Macintosh computers running the WebStar server. It allows users to directly manipulate 4D databases.
Microsoft Access
4W Publisher (http://www.4w.com/4wpublisher/)-This is a commercial product that generates static HTML pages from an Access database. A CGI version is due out soon that will allow dynamic access to the database.
A-XOrion (http://www.clark.net/infouser/endidc.htm)-This is a custom commercial database server for the Windows platform. It allows real-time access to major brand pc databases (Paradox, dBASE, FoxPro Access), but it requires Access to run.
dbWeb (http://www.axone.ch/dbWeb/)-This is a freeware tool to maintain large hypertexts using an SQL interface to Access. It has the capability to export large documents in a variety of formats including HTML pages, Microsoft Viewer, and tagged text.
DB2
DB2WWW (http://www.software.ibm.com/data/db2/db2wfac2.html)-This is IBM's own proprietary (and commercial) interface to their DB2 database system. It uses SQL statements to interact with the Web server and generate HTML pages.
FileMaker Pro
ROFM (http://rowen.astro.washington.edu/)-This is a freely available interface to FileMaker Pro databases. It is only for Macintosh computers.
FoxPro
FoxWeb (http://www.foxweb.com/)-This is a commercial product for Windows NT that allows Web server integration with FoxPro databases.
GemStone
GemStoneWWW (http://ftp.tuwien.ac.at/~go/Implementation.html)-This is a mini-HTTP server that handles requests that interface with a GemStone Smalltalk database system. It is freely available.
General
Amazon (http://www.ieinc.com/webnews.htm)-This is an all-purpose, commercial "legacy system" integrator. That is, it takes data from Web forms and uses it to interact with systems that were designed before the Web, such as Oracle, Sybase, SQL Server, ODBC systems, and others.
DBI (http://www.hermetica.com/techologia/DBI)-DBI is a database API (Application Programming Interface) for Perl. It is sort of a catch-all set of functions, variables, and conventions used to access any sort of database. There is currently support for Oracle, mSQL, Ingres, Informix, Sybase, Empress, C-ISAM, DB2, Quickbase, and Interbase databases, and more are in the works. Like all good Perl software, this is absolutely free.
DBGate (http://fcim1.csdc.com/DBGate/dbintro.htm)-This is a commercial "serverless" database product. It interacts directly with the database files (Access and FoxPro are currently supported), thereby removing the need for ODBC or SQL communication with a database server.
HyperStar WDK (http://www.vmark.com/Products/WDK/index.html)-This (commercial) package acts as a go-between from the Web browser to the server, interpreting database commands along the way. It supports a large number of databases including Oracle, Sybase, Informix, UniVerse, Ingres, Microfocus Cobol ISAM, and PI/open.
mgyWeb (http://www.mgy.ca/mgyweb/)-This is a commercial solution for Windows 95 or NT that supports "many" databases and any Windows 95-compatible Web server.
Sapphire/Web (http://www.bluestone.com/products/sapphire)-This is a commercial application builder that enables users to easily create C/C++ CGI programs that interface with several popular databases.
mSQL
MsqlPerl (ftp://Bond.edu.au/pub/Minerva/msql/Contrib/MsqlPerl-1.03.tar.gz)-This is a Perl interface to Msql that mimics the Msql API for the C language. It allows on-the-fly updates of any mSQL database and is freely available.
ODBC
Cold Fusion (http://www.allaire.com/cfusion/)-This is a commercial product for Windows 95 and NT that has become very popular in those circles. It allows for direct connectivity to any ODBC database. It contains scads of features and an easy-to-use interface.
DataRamp (http://dataramp.com)-A commercial client/server/browser combo that allows secure read/write access to any ODBM database (including Access).
Oracle
Decoux (http://www.abs.ee/~wow/htdocs/sdk10/decoux/)-This is a CGI interface to Oracle databases. It uses SQL statements imbedded in HTML tags that are parsed before sending the results to the client.
Oracle WWWIK (http://www.abs.ww/~wow/htdocs/)-This is a collection of CGIs and other tools to connect Oracle databases to the Web. They are all freely available.
ORAlink (http://oradb1.jinr/dubna.su/Software/ORALink)-This is a freeware product for Windows NT that uses CGI to access Oracle databases from the Web.
SQL
dbCGI (http://www.progress.com/webtools/dbcgi/dbcgi.htm)-This is a freely available C program that allows SQL statements to be embedded into HTML files. A CGI program interprets the statements and takes appropriate action before displaying the HTML to the client.
GSQL (http://www.santel.lu/SANTEL/SOFT/gsql_new.html)-This is a Mosaic only interface with SQL database servers. It is freely available.
web.sql (http://www.newmedia.sybase.com/Offerings/Websql/web_spec.html)-This is a free product from Sybase that allows SQL statements to be imbedded into HTML files. The statements are then translated by a CGI program, and the results are passed to the client.
Sybase
Genera (http://gdbdoc.gdb.org/letovsky/genera/genera.html)-This is a freely available package that integrates Sybase databases with the Web. It supports Web page generation from Sybase databases and HTML form queries.
Sybase WWW Tools (http://www.sybase.com/WWW/)-A collection of CGIs and other tools for integrating Sybase databases for the Web. Freely available.

This is nowhere near a complete list of CGI resources for databases, and new products are being developed constantly. Consider this a jumping-off point to explore the possibilities. But suppose you've painstakingly checked out all of the products available, and you still cannot find a CGI program to meet your needs. All is not lost. By using common tools found on many Web sites, it may be possible to build your own CGI to interact with your database server of choice.

Consider the address book example from the beginning of the chapter. How could that be accomplished with a database server? First we need to decide on the right tools for the job. A good freely available database is mSQL (http://www.hughes.com.au/). It offers most of the functionality of a full-fledged SQL server, with low overhead and easy installation. The mail mSQL distribution comes with a C API and function library, but we'll stick with Perl just 'cause it's cool. There are several mSQL interfaces for Perl, some of which were mentioned in the preceding section. We'll use MsqlPerl for no particular reason (I've found that all of the Perl mSQL interfaces work fairly well).

The MsqlPerl module provides direct emulation of the C API functions. You can submit queries to the database which add, select, delete, and modify the database directly.

Note

There are two things to be aware of here. First, the program msqld must be running on whatever machine the database is stored on. This need not be the same machine as the Web server because the API provides functions to connect to a remote mSQL server.

Secondly, MsqlPerl is compiled as a dynamically loaded extension to Perl. Certain systems have trouble dynamically loading from Perl (notably AIX and non-ELF Linux). If your MsqlPerl script is failing, make sure that your copy of Perl is able to dynamically load properly.

We can use the same HTML form pages we used for the text database example. The first form searched an existing database. The appropriate CGI program is shown in Listing 10.3.

Listing 10.3. Searching an SQL database.

#!/bin/perl require cgi_read; # Set up CGI environment use Msql; # Load the Msql module (See note above for a caveat.) $dbh=Connect Msql; # Connect to the local mSQL server SelectDB $dbh "addresses"; # Selected the "addresses database. Assume that # it already exists. mSQL comes with a utility # called 'msqladmin' which can create databases. $all="name, phone, street, city, state, zip"; # All address fields $query=""; # Set aside a variable to hold the query. foreach (keys %FORM) { # Gather all existing form requests into $query=. " $_=$FORM{'$_'} AND"; # one line. } $query=~ s/AND$//; # Get rid of that annoying trailing 'AND' $sth=Query $dbh "SELECT $all FROM addresses WHERE $query"; # Send query print "<HTML><HEAD><TITLE>Search Results</title></head><BODY>\n"; $I=0; # 0 mean no results. while (@arr=FetchRow $sth) { if ($I==0) { $I=1; } # 1 means 1 results if ($I==1) { # Print success message print "<H4>Your search results are listed below</h4>"; print "<PRE>\n"; }; $I++; # Print results one at a time. print <<EOF; -- -- Name: $arr[0] Phone: $arr[1] Street: $arr[2] City: $arr[3] State: $arr[4] Zip: $arr[5] EOF } if ($I==2) { print "</pre><br>"; print "<A HREF=\"addrbk.html\">Go</a> back to the form to make another Âsearch.<br>\n"; print "</body></html>\n"; } elsif ($I==0) { print <<EOE; <HTML><HEAD><TITLE>Search Failed!</title></head><BODY> <h4>There are no entries which match your criteria</h4> <A HREF="addrbk.html">Go</a> back to the form to make another search. </body></html> EOE }

The actual CGI code in this case is about one quarter the size of the equivalent CGI in the first example. In addition, it is trivial to add substring matching to an mSQL query. Also note the distinct absence of ID numbers. mSQL is a relational, random access database. That is, it does not need to read in all entries to access any one of them.

The real power of a database server becomes apparent when modifying the database itself.

Note

Because the modification of the database is done by the server itself, the file permission issues of text databases do not apply here. Instead, most database servers have their own access rights schema that can be used to allow only select users to access certain databases.

Now we consider the CGI required to add entries to the database. Once again, we assume the a database called addresses exists that contains a table called addresses.

#!/bin/perl require cgi_head; use Msql; $dbh=Connect Msql; # connect to the local mSQL server SelectDB $dbh "addresses" # select the 'addresses' database Query $dbh "INSERT INTO addresses ( name, phone, street, city, state, zip ) VALUES Â('$FORM{'name'}, $FORM{'phone'}, $FORM{'street'}, $FORM{'city'}, $FORM{'zip'} )"; print <<EOF; <HTML><HEAD><TITLE>Addition successful</title><head><BODY> Your entry has been added to the address book. <A HREF="add.html">Go</a> back to the form to add another entry. </body></html> EOF

Now I'll bet you're wondering why you ever used flat text databases to begin with! In a fraction of the code of the flat text version, a much more flexible addition scheme is produced. Of course, the biggest problem in using a database server is finding one. There are many situations in which you may have no control over which programs you have access to. In these cases, a text database may be your only recourse. Also, if a text database becomes corrupt, you load it into your favorite text editor and fix it. If your relational database becomes corrupt, unless you are a database expert you better have recent backups available.

Now into the final stretch, allowing for deleting and modifying entries from the database. Like the text database, a change must be made to the HTML form generated by the search program to allow the user to delete or modify the entry. Unlike the text database, we have no unique ID number to identify the entry. Therefore, we must pick another unique attribute to identify the entry. We could add an ID number into the database, but for simplicity's sake, we will use the name as the unique field. (And if you know two people with the exact same first, last, and middle names, you know too many people.) In databases, this sort of unique field is known as a "primary key."

So the lines near the end of the search CGI program

print "</pre><br>"; print "<A HREF=\"addrbk.html\">Go</a> back to the form to make another search.<br>\n"; print "</body></html>\n";

become

print "</pre><br>"; print "<A HREF=\"addrbk.html\">Go</a> back to the form to make another search.<br>\n"; if ($I==1) { print "<A HREF=\"change.cgi?a=d&name=$arr[0]\">Delete</a> this entry.<br>"; print "<A HREF=\"change.cgi?a=c&name=$arr[0]\">Modify</a> this entry.<br>"; } print "</body></html>\n";

And the CGI itself simplifies considerably:

#!/bin/perl require cgi_head; use Msql; $dbh=Connect Msql; SelectDB $dbh "addresses"; $all="name, phone, street, city, state, zip"; if ($FORM{'a'} eq "d") { Query $dbh "DELETE FROM addresses WHERE name=$FORM{'name'}"; print <<EOF; <HTML><HEAD><TITLE>Deletion successful</title></head><BODY> <h3>Your entry has been deleted</h3> <A HREF="addrbk.html">Go</a> back to the form to make another search. </body></html> EOF die; } elsif ($FORM{'a'} eq "c") { $guy=Query $dbh "SELECT $all FROM addresses WHERE name=$FORM{'name'}"; @guy=FetchRow $guy; print <<EOE; <HTML><HEAD><TITLE>Modify Me</title></head><BODY> <h4>Modify your entry in the form below</h4> <FORM ACTION="change.cgi" METHOD="POST"> <INPUT TYPE=HIDDEN NAME="a" VALUE="m"> Name: <INPUT SIZE=45 NAME="name" VALUE="$guy[0]"><br> Phone: <INPUT SIZE=45 NAME="phone" VALUE="$guy[1]"><br> Street: <INPUT SIZE=45 NAME="street" VALUE="$guy[2]"><br> City: <INPUT SIZE=20 NAME="city" VALUE="$guy[3]"> State: <INPUT SIZE=3 NAME="state" VALUE="$guy[4]"> Zip: <INPUT SIZE=6 NAME="zip" VALUE="$guy[5]"> <br><INPUT TYPE=SUBMIT VALUE=" Modify Entry "> <INPUT TYPE=RESET VALUE=" Reset Form "> </form></body></html> EOE die; } elsif ($FORM{'a'} eq "m") { foreach (keys %FORM) { # Gather all existing form requests into $query=. " $_=$FORM{'$_'},; # one line. } $query=~ s/,$//; # Get rid of that annoying trailing ',' Query $dbh "UPDATE addresses SET $query WHERE name=$FORM{'name'}"; print <<EOF; <HTML><HEAD><TITLE>Modification successful</title></head><BODY> <h3>Your entry has been modified</h3> <A HREF="addrbk.html">Go</a> back to the form to make another search. </body></html> EOF die; # Th' th' that's all folks }

It is as simple as that. Fifty lines of Perl as opposed to over one hundred. The sheer flexibility of a database server opens the door to worlds of possibilities.

Summary

The world of database interaction with the Web is as wide as it is deep. For any problem, there are dozens of solutions-some better than others. It all boils down to personal preference and experience. For a person with a background in Windows data processing, it would probably be easiest to use a Windows database such as Paradox or Access along with one of the ready-made CGI interfaces for them. A seasoned C programmer may prefer playing with one of the APIs provided for the various SQL servers such as Sybase, Oracle, SQL Server, and mSQL. Someone who is "just another Perl hacker" would probably feel most comfortable with one of the many Perl interfaces to SQL and ODBM database server (or perhaps they would just write one of their own).

Within this realm of great flexibility, certain things must be considered to make an effective Web-Database interaction:

What tools are available? Are you the systems administrator of a high-powered UNIX server? Or perhaps your only Internet connection is a PPP linked Windows machine. It is possible to get almost any job done with any tool, but it helps greatly to know your capabilities before you begin. If you are using a Windows platform, it may not be worth it to fight with badly implemented versions of programming languages when there is a huge library of tools already available, albeit for a cost. If UNIX is your stomping grounds, you may want to think twice before shelling out thousands of dollars for a Sybase or Oracle server when free tools such as mSQL and POSTGRES are easily available. (But then again, perhaps your needs are so complex that only a high-end database server will suffice.) It pays to be aware of what you have available to you.
What is the size of your project? Size does matter. As we have seen in this chapter, flat text based databases simply don't scale well with size. A 50 MB version of our phone book could bring even the most powerful RISC server to its knees. At the same time, it takes time to construct a database using a server. You need to have permission, and above all, you need to have access to a server. If you are a student with a university granted account, there's a good chance that text is all you have to deal with. Even between database servers, the size and complexity of your project makes a difference. Keeping a database of every phone number in America might be tough job for a personal database like Access or Paradox. On the other hand, a database with tables linked to tables in weird and convoluted ways is simply not possible in free servers like mSQL. (POSTGRES does has more functionality, at the price of speed, complexity, and lack of standard SQL interface.) Before you begin your project, try to anticipate its maximum size and plan accordingly.
To program or not to program (which language is the question)? With the rather considerable list of resources shown earlier in the chapter, why would anyone need to write their own CGI at all? Perhaps you don't. If you have the money to spend, there is a good chance that something is out there that will suit your needs. Even if you aren't rich, you would probably find something close. But it is not always close enough. Maybe one little tweak would add that last little feature you want. Or perhaps the existing software does too much, and you would prefer the added speed gained by a few less bells and whistles. A little knowledge of how CGIs communicate with databases can go a long way.
Is there an alternative to CGI? The Web is a big place, growing every day, and CGI is no longer the only game in town. Don't fall into the trap of ignoring new technologies just because they are new. (At the same time, don't be blind to their faults. Every new language or protocol must go through a testing period. During that time, applications using that language or protocol have great potential to cause damage, either directly or by opening security holes.) Several companies have developed original Web servers that are specifically designed to interface with databases. For example, NeXT Inc.'s WebObjects (http://www.next.com) is an innovative object-oriented Web server that directly interacts with a powerful back-end database. In addition to these there is Java, which has gained an enormous amount of publicity in its first year on the Internet. Java has the capability to open persistent connections between the server and client, providing a means for updating the user's view of the database continually. For example, using Java with a database of stock prices could provide a running stock ticker on the user's screen. Beware, however, Java is the archtypal "new technology" with all of the good and bad that comes with it.

Using CGI to integrate databases with the Web follows naturally from the capabilities of CGI. Bypassing a great deal of CGI's shortcomings, database interaction is perhaps one of the only areas in which CGI will remain the best tool for the job in the face of new technologies.