Storage and retrieval of chemical structures using a relational database usually requires purchasing one of the commercially available and very powerful software products designed for this purpose, e.g. solutions from MDL Inc. or CambridgeSoft. These database systems are implemented (e.g.) as add-ons for the Oracle database engine (another commercial product), using a relational Oracle SQL database as the underlying storage container, and adding the requisite functionality (e.g. for structure/substructure retrieval) as a so-called "cartridge" which provides an interface to other applications. In a simplified picture, these latter applications then provide the user interface, they can be accessed e.g. via a web browser or via proprietary protocols and clients.
There are also other commercial solutions available, e.g. the JChem
package by Chemaxon, which can
use any relational SQL database for
storage and in which the "chemical intelligence" is implemented
by a set of Java programs running as servlets on a web server. Of
course, this is not a complete listing of available chemical database
management systems, and all of these products have their particular
merits, especially when it comes to handling very large numbers of
database entries. However, it is not very surprising that such software
packages are not really cheap. On the other hand, for many purposes it
would be sufficient to have a web-based chemical database capable of
handling somewhat smaller structure collections, e.g. in the range of
a few thousand to tenthousands of molecules. Unfortunately, it is not
easy to find software (preferably freeware) which permits the creation
of relatively simple, web-based molecular structure databases with full
structure/substructure search capabilities, although there are
promising building blocks available (such as the Chemistry Development Kit
which, however, requires installation of a complex Java software
bundle).
Therefore, we wrote such a piece of software ourselves. In the
following, it
will be shown how a standard PC running the Linux operating system, in
combination with the Apache web server, the MySQL database system, and
PHP as the scripting language can be enhanced by means of a compact
command-line program named "checkmol/matchmol"
into a searchable
chemical structure database.
Basic considerations
We decided to use a commonly used file format to store the
structures of our molecules, and this is the MDL mol format which is
well documented
and which is properly handled by many programs, including the popular ISIS/Draw
structure editor. At the present stage, we assume that structures are
created by such a helper application and are transferred to the web
server, e.g. by ftp or a Samba network share. Later on, one might
consider creating a web-based tool for on-line structure entry.
Whereas other chemical databases use an SQL table to store molecular
structures (e.g., in a "TEXT" or "BLOB" field), we initially decided to
place the
structures just as individual files in the web server's file system, as
they can
be easily inspected this way by command-line tools such as "less" and
"grep". As
filenames, we chose unique 5-digit numbers in the range of 00001
through 99999,
with the usual file name extension of ".mol". (Note: in the meantime
also the other option, storing structures as records in a database
table, was found to work very nicely and gives even better performance)
In addition, several MySQL tables are created within a database
named (e.g.) "moldb". The central table, named (e.g.) "mol", holds a)
the
unique entry number (mol_id) of each molecule (which corresponds to the
numeric portion of the filename) and b) textual information, such as
CAS registry number,
chemical nomenclature, stock quantity, and so on. Two other tables,
named "molstat" and "molfg", are required for efficient searching (see
below).
Setting up the database
The basic setup of the database is very easy, an example is given
below. First, a new directory within the web server's document root is
created, we named this directory "moldb". This is the place where we
store the HTML and/or PHP files which provide the user interface.
Within this directory, a subdirectory is created which holds the
molecular structure files; we named it "molfiles". Thus, assuming the
web server's document root to be "/data/htdocs" (it could also be
"/var/www" or something similar), the two new directories would be:
/data/htdocs/moldb
All of our MDL molfiles are copied into the latter directory.
/data/htdocs/moldb/molfiles
Next, the new MySQL database "moldb" is created, then two user accounts are added: one with administrative privileges and one with read-only access (the latter one will be used as a proxy-user account by the PHP script which provides the web-based user interface).
As MySQL administrative user (e.g., "root"), enter the following
commands within the MySQL console (of course, you should choose better
passwords than these examples):
Next, the table "mol" is created and populated with entries. We do
this with a simple shell script ("mkmol.sh", see below) which scans the
specified molfile
directory and adds an entry into the "mol" table for each molecule it
finds. Moreover, the script extracts the molecule's name from the first
line of the molfile (which can contain a comment) and places this
string into the field "mol_name" in
the "mol" table. Note: as the script contains the plain-text password,
make it unreadable for other users.
Creating auxiliary tables for
search operations
For efficient structure/substructure searching, it is essential to
have a set of
structural feature descriptors ("screens", "fingerprints", "keys",
etc.) which have to be
generated when a new molecule has been placed into the collection and
which are stored in another database table ("molstat"). This is the
first time our
utility program "checkmol" comes into effect.
By
invoking checkmol with the
command-line option "-X" and the filename of the molfile as arguments,
a set of such descriptors is generated and written to standard output.
From there, it can be captured and placed in the appropriate table row.
In fact, we use a Perl script (mkmolstat.pl)
to scan the whole molfile directory and
to create a "molstat" table, containing the mol_id and the values of
the descriptors as integers.
Analogously, we make use of checkmol's capability to recognize a
large number of functional groups in a given molecule. Our molecular
structure database thus can be enhanced by a "functional group" search
option. For this purpose, another table ("molfg") is created and
populated by running an appropriate Perl script (mkmolfg.pl).
Query input tools
For creating the query structure, the Java applet JME is
used. JME was developed by Peter Ertl at Novartis and can be obtained
from the author on request (free of charge). The applet can be easily
embedded into a web page; documentation and several examples are
available on the JME
homepage.
The output of JME can be
SMILES as
well as a JME native format or MDL molfile. We use the latter to
prepare our query.
Structure pre-selection with checkmol
Having obtained the MDL molfile representation of the input
structure (after the user has selected the "Submit" button of an
appropriate form), we need "checkmol" again.
Now, the structural
descriptors of the query molecule are generated by "piping" the molfile
through the "checkmol" command (this time with the "-x" option) and
capturing its output. In PHP, this can be done by the popen()
function, which we use as the core element of a function named filterThroughCmd()
(for details, check the PHP documentation for popen).
Having obtained the query molfile as the PHP variable $mol (via GET
or POST from the calling PHP script), we use the above function to do
the first check, after passing the content of $mol through the PHP
escapeshellcmd() function (for security reasons).
if ($mol !='') {
........(some further processing, see below)............
From the output which is stored in the variable $chkresult, an SQL
query string is
assembled, using basic PHP statements (str_replace
). The
final SQL query string would look like "SELECT mol_id FROM
molstat WHERE n_atoms>=12 AND n_bonds>=13 AND n_rings>=2 AND
....
". When
this query is submitted to
the MySQL database, a list of candidates is returned which all meet the
defined minimum requirements (e.g., number of rings containing two
nitrogens, number of halogen atoms, number of sp2-hybridized oxygens,
etc.). From this list, we must make our final selection by an
atom-by-atom comparison of the input structure with the respective
candidate structure.
Structure matching with matchmol
For this purpose, we use the "matchmol"
incarnation of our utility
program. It compares the query structure with each of the candidate
structures and returns "true" or "false", depending on the match result
(is the "needle" structure contained in the "haystack" structure or
not?). In principle, this operation can be performed by a PHP script as
follows:
a) save the query structure as a temporary file, e.g., in the
server's /tmp directory
b) call "matchmol" with the two filenames as arguments (first file: the
temporary file, second file: the candidate molecule)
c) if the output looks like ":T", the structures match, if the answer
contains ":F", they don't
d) display the hit structure appropriately on a web page which, of
course, is
dynamically generated with PHP
e) after having checked all candidates, delete the temporary file
An alternative option, avoiding the need to write any temporary
file, would
be:
a) store the query structure in a PHP variable
b) read the content of the candidate structure into another PHP variable
c) concatenate the two variables into one new variable, using the
string "$$$$
" as the separator; this gives a valid SD-file
with two
entries, the first one being the query structur (note that the "$"
characters must be double-protected against interpretation by PHP and
the Unix shell by writing the SD-file separator as "\\\$\\\$\\\$\\\$
"
!!)
d) call "matchmol" with "-" as the filename argument and pipe the
SDF content through the utility
e) capture and interpret the output (":T" or ":F", see above)
f) repeat steps b) to e) for all candidates, displaying the hits
adequately
Examples for both variants are given in the following PHP code
snippets:
$result2 = mysql_query("SELECT mol_name
FROM mol WHERE mol_id = $mol_id")
// if using option A, delete the temporary file we have created
before
unlink($tmpfname);
--------------------------------------------------------------
// option B: store needle
and haystack in one single variable in SD-file
// format and pipe it
through matchmol via stdin (note extra backslashes
// to protect the
"$$$$" SDF record delimiter!)
while ($line = mysql_fetch_array($result, MYSQL_ASSOC)) {
$mol_id =
$line[mol_id];
while
(strlen($mol_id) < 5) { $mol_id = "0" . $mol_id; }
// insert padding zeroes
$hsfname =
"${molfiledir}/${mol_id}.mol";
$hshandle =
fopen($hsfname, "r");
$haystack =
fread($hshandle, filesize($hsfname));
fclose($hshandle);
$qstruct =
$safemol . "\n\\\$\\\$\\\$\\\$\n" . $haystack;
$matchresult = filterThroughCmd("$qstruct", "/usr/local/bin/matchmol
-");
if
(strstr($matchresult,":T") != FALSE) {
// retrieve the molecule's name from the "mol" table
$result2 = mysql_query("SELECT mol_name FROM mol WHERE mol_id =
$mol_id")
or die("Query failed!");
while ($line2 = mysql_fetch_array($result2, MYSQL_ASSOC)) {
$mol_name = $line2[mol_name];
// further fields can go here, if available (nomenclature, CAS reg.
no., etc.)
}
mysql_free_result($result2);
print "<p><b>$mol_name</b>
($mol_id)</p>\n";
..........output of structure formula...........
//
end of hit output
}
}
Better
performance by using matchmol in "burst mode"
$bs =
10;
// block size
$n_cand =
mysql_num_rows($result); // number of candidate
structures
$qstruct =
$safemol;
// query string in SDF format (entry #1 = needle)
$bi =
0;
// counter within block
$n =
0;
// number of candidates already processed
$maxbmem =
0;
// for diagnostic purposes only
while ($line = mysql_fetch_array($result, MYSQL_ASSOC)) {
$mol_id = $line[mol_id];
while (strlen($mol_id) < 5) { $mol_id = "0"
. $mol_id; }
$hsfname =
"${molfiledir}/${mol_id}.mol";
$hshandle = fopen($hsfname, "r");
$haystack = fread($hshandle,
filesize($hsfname));
fclose($hshandle);
// "burst mode":
// store candidate numbers (mol_id) in array $b
// store query as $qstruct: consists of 1
needle + several haystack molecules (up to $bs)
// store result of matchmol invocation in
array $br
$b[$bi] = $mol_id;
$qstruct = $qstruct . "\n\\\$\\\$\\\$\\\$\n" .
$haystack;
$bi ++;
$n ++;
if (($bi == $bs) || ($n == $n_cand)) {
if (strlen($qstruct) >
$maxbmem) { $maxbmem = strlen($qstruct); }
$matchresult =
filterThroughCmd("$qstruct ", "/usr/local/bin/matchmol -");
$br = explode("\n", $matchresult);
for ($i = 0; $i < $bi; $i++) {
if
(strstr($br[$i],":T") != FALSE) {
$hits ++;
showHit($b[$i]);
}
}
$qstruct = $safemol;
$bi = 0;
}
}
Functional group search
Searching
for molecules containing particular functional groups can be easily
implemented, making use of the "molfg" table containing the functional
group codes of all molecules in the collection. The user is presented
with a selection list (multiple selections should be possible, usually
done with Ctrl-Click or Shift-Click), selection of the "submit" button
on an appropriate form results in an SQL query which returns a list of
molecule identifiers (mol_id), which finally is used to display the
retrieved molecules.
Display of the search result
Of course, the retrieved molecules should be presented to the user
as graphical representations. Depending on the requirements, this can
be accomplished in different ways:
while
loop which retrieves the
hits, see above):
print
"<p><b>$mol_name</b>
($mol_id)</p>\n";
// JME needs MDL
molfiles with the
"|" character instead of linebreaks
$jmehaystack =
strtr($haystack,"\n","|");
echo "<applet
code=\"JME.class\" codebase=\"/classes\" archive=\"JME.jar\" \n";
echo "width=\"250\"
height=\"120\">";
echo "<param name=\"options\"
value=\"depict\"> \n";
echo "<param name=\"mol\"
value=\"$jmehaystack\">\n";
echo "</applet>\n";
// end of hit output
while
loop, with the ID number of the hit
structure as argument. This function could look like this:function showHit($id) {
//global $bitmapdir; // if bitmap pictures are
available (e.g. in GIF format)
global $molfiledir;
$result2 = mysql_query("SELECT mol_name FROM mol WHERE
mol_id = $id")
or die("Query failed");
while ($line2 = mysql_fetch_array($result2, MYSQL_ASSOC)) {
$txt = $line2[mol_name];
}
mysql_free_result($result2); // looking up the
molecule's name is necessary
// if we use matchmol in "burst mode"
print "<p><b>$txt</b> ($id)</p>\n";
// for faster display, we should have GIF files of the 2D
structures
// instead of invoking the JME applet multiple times:
// print "<img src=\"${bitmapdir}/${id}.gif\">\n";
// if no GIFs are available, we must invoking another
instance of JME
// in "depict" mode for structure display of each hit
//$hitmol = `cat ${molfiledir}/${id}.mol`;
$hmfname = "${molfiledir}/${id}.mol";
$hmhandle = fopen($hmfname, "r");
$hitmol = fread($hmhandle,
filesize($hmfname));
fclose($hmhandle);
// JME needs MDL molfiles with the "|" character instead
of linebreaks
$jmehitmol = strtr($hitmol,"\n","|");
echo "<applet code=\"JME.class\" codebase=\"/classes\"
archive=\"JME.jar\" \n";
echo "width=\"250\" height=\"120\">";
echo "<param name=\"options\" value=\"depict\"> \n";
echo "<param name=\"mol\" value=\"$jmehitmol\">\n";
echo "</applet>\n";
}
[Screenshot]Performance issues
Speed-up of SQL queries can be usually done by creation of indices wherever possible (see MySQL documentation). Another issue, however, is the performance of the checkmol/matchmol program, which does the critical part of the job in a structure/substructure search. Especially when a large number of candidate structures has to be subjected to an atom-by-atom match with the query structure, it can be very benefitial to reduce a) the number of matchmol invocations (by using "burst mode", see above) and b) matchmol's time of execution to a minimum. This can be achieved by keeping matchmol from performing the routines which find the set of all rings and determine whether they are aromatic or not. This task can be rather time-consuming in a large molecule with many rings (e.g., the molecular graph of cyanocobalamine contains more than 500 rings!). Instead, matchmol can read this information from slightly tweaked molfiles which can be generated in advance (e.g., whenever a new molecule is entered into the database), again with the "checkmol" program, using the "-m" command-line option. These "tweaked" molfiles are still compliant to MDL's format definition and can be processed by other applications without difficulty. In fact, the aromaticity flag for atoms and bonds is "hidden" as a leading zero ("0") digit in a particular column of the file. This extra information, however, is interpreted only by "matchmol". In any case, we do not have to trust the "original" MDL bond type of "aromatic" which can be defined by the user who draws the molecule (in many cases, aromatic structures are drawn using alternate single and double bonds, so relying just on user-defined "aromatic" bond types would be a high risk). Transformation of the entire content of the molfile directory into this enhanced format can be done very easily by this shell script:#!/bin/sh
# optmol.sh
# transforms standard MDL molfiles into checkmol-enhanced
# molfiles containing aromaticity information irrespective
# of user-defined "aromatic" bond type
cd /data/htdocs/moldb/molfiles
if [ ! -d archive ]; then
mkdir archive
mv *.mol archive
fi
cd archive
for i in *.mol
do
echo $i
/usr/local/bin/checkmol -m $i > ../$i
done
/bin/ash
as the standard shell instead
of /bin/bash
(thanks to E. G. Schmid for this hint). In
most Linux distributions, this can be done by removing the symbolic
link /bin/sh
which points to /bin/bash
and
creating a new one pointing to /bin/ash
. Caution: this might break
some startup scripts of your system. In order to avoid such problems,
you should replace all occurrences of the line "#!/bin/sh
"
by "#!/bin/bash
" in these scripts.