Creating a Web-based, Searchable Molecular Structure Database Using Free Software

Norbert Haider, Department of Drug and Natural Product Synthesis, University of Vienna, Austria
norbert.haider@univie.ac.at


Update: For a fully functional package of free software which was developed for this purpose, please visit the MolDB5R homepage.

The problem ....

Storage and retrieval of chemical structures using a relational database usually requires purchasing one of the commercially available and very powerful software products designed for this purpose, e.g. solutions from MDL Inc. or CambridgeSoft. These database systems are implemented (e.g.) as add-ons for the Oracle database engine (another commercial product), using a relational Oracle SQL database as the underlying storage container, and adding the requisite functionality (e.g. for structure/substructure retrieval) as a so-called "cartridge" which provides an interface to other applications. In a simplified picture, these latter applications then provide the user interface, they can be accessed e.g. via a web browser or via proprietary protocols and clients.

There are also other commercial solutions available, e.g. the JChem package by Chemaxon, which can use any relational SQL database for storage and in which the "chemical  intelligence" is implemented by a set of Java programs running as servlets on a web server. Of course, this is not a complete listing of available chemical database management systems, and all of these products have their particular merits, especially when it comes to handling very large numbers of database entries. However, it is not very surprising that such software packages are not really cheap. On the other hand, for many purposes it would be sufficient to have a web-based chemical database capable of handling somewhat smaller structure collections, e.g. in the range of a few thousand to tenthousands of molecules. Unfortunately, it is not easy to find software (preferably freeware) which permits the creation of relatively simple, web-based molecular structure databases with full structure/substructure search capabilities, although there are promising building blocks available (such as the Chemistry Development Kit which, however, requires installation of a complex Java software bundle).

Therefore, we wrote such a piece of software ourselves. In the following, it will be shown how a standard PC running the Linux operating system, in combination with the Apache web server, the MySQL database system, and PHP as the scripting language can be enhanced by means of a compact command-line program named "checkmol/matchmol" into a searchable chemical structure database.

.... and how it can be solved

Basic considerations

We decided to use a commonly used file format to store the structures of our molecules, and this is the MDL mol format which is well documented and which is properly handled by many programs, including the popular ISIS/Draw structure editor. At the present stage, we assume that structures are created by such a helper application and are transferred to the web server, e.g. by ftp or a Samba network share. Later on, one might consider creating a web-based tool for on-line structure entry.

Whereas other chemical databases use an SQL table to store molecular structures (e.g., in a "TEXT" or "BLOB" field), we initially decided to place the structures just as individual files in the web server's file system, as they can be easily inspected this way by command-line tools such as "less" and "grep". As filenames, we chose unique 5-digit numbers in the range of 00001 through 99999, with the usual file name extension of ".mol". (Note: in the meantime also the other option, storing structures as records in a database table, was found to work very nicely and gives even better performance)

In addition, several MySQL tables are created within a database named (e.g.) "moldb". The central table, named (e.g.) "mol", holds a) the unique entry number (mol_id) of each molecule (which corresponds to the numeric portion of the filename) and b) textual information, such as CAS registry number, chemical nomenclature, stock quantity, and so on. Two other tables, named "molstat" and "molfg", are required for efficient searching (see below).

Setting up the database

The basic setup of the database is very easy, an example is given below. First, a new directory within the web server's document root is created, we named this directory "moldb". This is the place where we store the HTML and/or PHP files which provide the user interface. Within this directory, a subdirectory is created which holds the molecular structure files; we named it "molfiles". Thus, assuming the web server's document root to be "/data/htdocs" (it could also be "/var/www" or something similar), the two new directories would be:
/data/htdocs/moldb
/data/htdocs/moldb/molfiles
All of our MDL molfiles are copied into the latter directory.

Next, the new MySQL database "moldb" is created, then two user accounts are added: one with administrative privileges and one with read-only access (the latter one will be used as a proxy-user account by the PHP script which provides the web-based user interface).

As MySQL administrative user (e.g., "root"), enter the following commands within the MySQL console (of course, you should choose better passwords than these examples):

CREATE DATABASE moldb;
GRANT ALL ON moldb.* TO "moldb-admin"@"localhost" IDENTIFIED BY "topsecret";
GRANT SELECT ON moldb.* TO "moldb-user"@"localhost" IDENTIFIED BY "secret";

Next, the table "mol" is created and populated with entries. We do this with a simple shell script ("mkmol.sh", see below) which scans the specified molfile directory and adds an entry into the "mol" table for each molecule it finds. Moreover, the script extracts the molecule's name from the first line of the molfile (which can contain a comment) and places this string into the field "mol_name" in the "mol" table. Note: as the script contains the plain-text password, make it unreadable for other users.

#!/bin/sh
# mkmol.sh,   N. Haider, 2003
molfiledir=/data/htdocs/moldb/molfiles
uid="moldb-admin"
pw="topsecret"
cd $molfiledir
# first, we drop the table "mol" if it already exists
result=`echo "USE moldb; DROP TABLE mol ;" | mysql -u $uid --password="$pw"`
echo $result
# then we create a new table "mol"
result=`echo "USE moldb; CREATE TABLE mol (mol_id INT(11), mol_name VARCHAR(60), \
mol_nomencl VARCHAR(80), mol_cas VARCHAR(60), KEY mol_id (mol_id)) ;" | \
mysql -u $uid --password="$pw"`
echo $result
#now we look for all files having a 5-digit number + extension ".mol" as filename
for i in [0-9][0-9][0-9][0-9][0-9].mol
do
  mol_id=`echo $i |sed -e s/\.mol//g`
  mol_name=`head -1 < $i`
  echo $mol_id:$mol_name
  result=`echo "USE moldb; INSERT INTO mol VALUES ($mol_id, \"$mol_name\", \"\", \"\" ) ; "| \
  mysql -u $uid --password="$pw"`
  # echo $result
done

Creating auxiliary tables for search operations

For efficient structure/substructure searching, it is essential to have a set of structural feature descriptors ("screens", "fingerprints", "keys", etc.) which have to be generated when a new molecule has been placed into the collection and which are stored in another database table ("molstat"). This is the first time our utility program "checkmol" comes into effect. By invoking checkmol with the command-line option "-X" and the filename of the molfile as arguments, a set of such descriptors is generated and written to standard output. From there, it can be captured and placed in the appropriate table row. In fact, we use a Perl script (mkmolstat.pl) to scan the whole molfile directory and to create a "molstat" table, containing the mol_id and the values of the descriptors as integers.

Analogously, we make use of checkmol's capability to recognize a large number of functional groups in a given molecule. Our molecular structure database thus can be enhanced by a "functional group" search option. For this purpose, another table ("molfg") is created and populated by running an appropriate Perl script (mkmolfg.pl).

Query input tools

For creating the query structure, the Java applet JME is used. JME was developed by Peter Ertl at Novartis and can be obtained from the author on request (free of charge). The applet can be easily embedded into a web page; documentation and several examples are available on the JME homepage. The output of JME can be SMILES as well as a JME native format or MDL molfile. We use the latter to prepare our query.

[Screenshot]

Structure pre-selection with checkmol

Having obtained the MDL molfile representation of the input structure (after the user has selected the "Submit" button of an appropriate form), we need "checkmol" again. Now, the structural descriptors of the query molecule are generated by "piping" the molfile through the "checkmol" command (this time with the "-x" option) and capturing its output. In PHP, this can be done by the popen() function, which we use as the core element of a function named filterThroughCmd() (for details, check the PHP documentation for popen).

function filterThroughCmd($input, $commandLine) {
  $pipe = popen("echo \"$input\"|$commandLine" , 'r');
  if (!$pipe) {
   print "pipe failed.";
   return "";
  }
  $output = '';
  while(!feof($pipe)) {
    $output .= fread($pipe, 1024);
  }
  pclose($pipe);
  return $output;
}

Having obtained the query molfile as the PHP variable $mol (via GET or POST from the calling PHP script), we use the above function to do the first check, after passing the content of $mol through the PHP escapeshellcmd() function (for security reasons).

$safemol = escapeshellcmd($mol);
// A previous version of JME did not support explicit hydrogens, so we used the Helium
// symbol (He) to define an explicit H  ==> "He" had to be translated back into "H " prior
// to further processing by checkmol/matchmol (this is now obsolete).
//$safemol = str_replace("He","H ",$safemol);


if ($mol !='') {
  // first step: get the molecular "fingerprint" of the input structure
  // by piping it through the checkmol program, then do a search
  // in molstat ==> this gives a list of candidates
  $chkresult = filterThroughCmd("$safemol", "/usr/local/bin/checkmol -x - "); 
  ........(some further processing, see below)............
}

From the output which is stored in the variable $chkresult, an SQL query string is assembled, using basic PHP statements (str_replace). The final SQL query string would look like "SELECT mol_id FROM molstat WHERE n_atoms>=12 AND n_bonds>=13 AND n_rings>=2 AND ....". When this query is submitted to the MySQL database, a list of candidates is returned which all meet the defined minimum requirements (e.g., number of rings containing two nitrogens, number of halogen atoms, number of sp2-hybridized oxygens, etc.). From this list, we must make our final selection by an atom-by-atom comparison of the input structure with the respective candidate structure.

Structure matching with matchmol

For this purpose, we use the "matchmol" incarnation of our utility program. It compares the query structure with each of the candidate structures and returns "true" or "false", depending on the match result (is the "needle" structure contained in the "haystack" structure or not?). In principle, this operation can be performed by a PHP script as follows:

a) save the query structure as a temporary file, e.g., in the server's /tmp directory
b) call "matchmol" with the two filenames as arguments (first file: the temporary file, second file: the candidate molecule)
c) if the output looks like ":T", the structures match, if the answer contains ":F", they don't
d) display the hit structure appropriately on a web page which, of course, is dynamically generated with PHP
e) after having checked all candidates, delete the temporary file

An alternative option, avoiding the need to write any temporary file, would be:

a) store the query structure in a PHP variable
b) read the content of the candidate structure into another PHP variable
c) concatenate the two variables into one new variable, using the string "$$$$" as the separator; this gives a valid SD-file with two entries, the first one being the query structur (note that the "$" characters must be double-protected against interpretation by PHP and the Unix shell by writing the SD-file separator as "\\\$\\\$\\\$\\\$" !!)
d) call "matchmol" with "-" as the filename argument and pipe the SDF content through the utility
e) capture and interpret the output (":T" or ":F", see above)
f) repeat steps b) to e) for all candidates, displaying the hits adequately

Examples for both variants are given in the following PHP code snippets:

  // option A: store the "needle" molecule as a temporary file
  // and call matchmol with 2 filenames as arguments (file 1: needle,
  // file 2: haystack)

 
$tmpfname = tempnam("/tmp", "mm");
  $handle = fopen($tmpfname, "w");
  fwrite($handle, $mol);
  fclose($handle);
  while ($line = mysql_fetch_array($result, MYSQL_ASSOC)) {
    $mol_id = $line[mol_id];
    while (strlen($mol_id) < 5) { $mol_id = "0" . $mol_id; }  // insert padding zeroes
    $hsfname = "${molfiledir}/${mol_id}.mol";
    $hshandle = fopen($hsfname, "r");
    $haystack = fread($hshandle, filesize($hsfname));
    fclose($hshandle);
   
    // needle and haystack are stored in two individual files
    $matchresult = `/usr/local/bin/matchmol $tmpfname ${molfiledir}/${mol_id}.mol`;
    if (strstr($matchresult,":T") != FALSE) {
      // retrieve the molecule's name from the "mol" table
      $result2 = mysql_query("SELECT mol_name FROM mol WHERE mol_id = $mol_id")

        or die("Query failed!");   
      while ($line2 = mysql_fetch_array($result2, MYSQL_ASSOC)) {
        $mol_name = $line2[mol_name];
        // further fields can go here, if available (nomenclature, CAS reg. no., etc.)
      }
      mysql_free_result($result2);
      print "<p><b>$mol_name</b> ($mol_id)</p>\n";   
      ..........output of structure formula...........
      // end of hit output
    }
  // if using option A, delete the temporary file we have created before
  unlink($tmpfname);

--------------------------------------------------------------

  // option B: store needle and haystack in one single variable in SD-file
  // format and pipe it through matchmol via stdin (note extra backslashes
  // to protect the "$$$$" SDF record delimiter!)

  while ($line = mysql_fetch_array($result, MYSQL_ASSOC)) {

    $mol_id = $line[mol_id];
    while (strlen($mol_id) < 5) { $mol_id = "0" . $mol_id; }
  // insert padding zeroes
    $hsfname = "${molfiledir}/${mol_id}.mol";
    $hshandle = fopen($hsfname, "r");
    $haystack = fread($hshandle, filesize($hsfname));
    fclose($hshandle);

    $qstruct = $safemol . "\n\\\$\\\$\\\$\\\$\n" . $haystack;
    $matchresult = filterThroughCmd("$qstruct", "/usr/local/bin/matchmol -");

    if (strstr($matchresult,":T") != FALSE) {
      // retrieve the molecule's name from the "mol" table
      $result2 = mysql_query("SELECT mol_name FROM mol WHERE mol_id = $mol_id")
        or die("Query failed!");   
      while ($line2 = mysql_fetch_array($result2, MYSQL_ASSOC)) {
        $mol_name = $line2[mol_name];
        // further fields can go here, if available (nomenclature, CAS reg. no., etc.)
      }
      mysql_free_result($result2);

      print "<p><b>$mol_name</b> ($mol_id)</p>\n";   
      ..........output of structure formula...........
      // end of hit output
    }
  }

Better performance by using matchmol in "burst mode"

Both of the above options have the disadvantage that matchmol must be separately invoked for each candidate structure, which can consume a lot of time. Performance can be significantly enhanced, however, by using matchmol in "burst mode". This makes use of matchmol's capability to perform the atom-by-atom match also with SDF files containing multiple molecules. Similar to option B (see above), we prepare a query string containing the needle structure as the first entry and the haystack which now consists of several molfiles (again separated by "$$$$"). The maximum number of haystack molecules in one chunk is controlled by the variable $bs (blocksize). A blocksize of 10 therefore will reduce the number of calls to the external program by a factor of about 10.
Of course, there is a little bit more PHP overhead required in this case, because we have to parse matchmol's output in order to determine the ID of each hit structure. These candidate and ID numbers are stored temporarily in an array ($b), the matchmol results end up in another array ($br). Moreover, the blocksize should be chosen low enough to keep the overall size of the query string safely below 128 KB. For average drug-like molecules in the database, a $bs value of 10 should be fine.

The following PHP code snippet illustrates how this "burst" approach works:

  $bs      = 10;                          // block size
  $n_cand  = mysql_num_rows($result);     // number of candidate structures
  $qstruct = $safemol;                    // query string in SDF format (entry #1 = needle)
  $bi      = 0;                           // counter within block
  $n       = 0;                           // number of candidates already processed
  $maxbmem = 0;                           // for diagnostic purposes only
   
  while ($line = mysql_fetch_array($result, MYSQL_ASSOC)) {
    $mol_id = $line[mol_id];
    while (strlen($mol_id) < 5) { $mol_id = "0" . $mol_id; }
    $hsfname = "${molfiledir}/${mol_id}.mol";
    $hshandle = fopen($hsfname, "r");
    $haystack = fread($hshandle, filesize($hsfname));
    fclose($hshandle);
   
    // "burst mode":
    // store candidate numbers (mol_id) in array $b
    // store query as $qstruct: consists of 1 needle + several haystack molecules (up to $bs)
    // store result of matchmol invocation in array $br
    $b[$bi] = $mol_id;
    $qstruct = $qstruct . "\n\\\$\\\$\\\$\\\$\n" . $haystack;   
    $bi ++;
    $n ++;
    if (($bi == $bs) || ($n == $n_cand)) {
      if (strlen($qstruct) > $maxbmem) { $maxbmem = strlen($qstruct); }
      $matchresult = filterThroughCmd("$qstruct ", "/usr/local/bin/matchmol -");
      $br = explode("\n", $matchresult);
      for ($i = 0; $i < $bi; $i++) {
        if (strstr($br[$i],":T") != FALSE) {
          $hits ++;
          showHit($b[$i]);
        }
      }
      $qstruct = $safemol;
      $bi = 0;
    }
  }

Functional group search

Searching for molecules containing particular functional groups can be easily implemented, making use of the "molfg" table containing the functional group codes of all molecules in the collection. The user is presented with a selection list (multiple selections should be possible, usually done with Ctrl-Click or Shift-Click), selection of the "submit" button on an appropriate form results in an SQL query which returns a list of molecule identifiers (mol_id), which finally is used to display the retrieved molecules.

[Screenshot]

Display of the search result

Of course, the retrieved molecules should be presented to the user as graphical representations. Depending on the requirements, this can be accomplished in different ways:

A simple structure display routine, using the JME applet could look like this (within the while loop which retrieves the hits, see above):

      print "<p><b>$mol_name</b> ($mol_id)</p>\n";   

      // JME needs MDL molfiles with the "|" character instead of linebreaks
      $jmehaystack = strtr($haystack,"\n","|");
      echo "<applet code=\"JME.class\" codebase=\"/classes\" archive=\"JME.jar\" \n";
      echo "width=\"250\" height=\"120\">";
      echo "<param name=\"options\" value=\"depict\"> \n";
      echo "<param name=\"mol\" value=\"$jmehaystack\">\n";
      echo "</applet>\n";
      // end of hit output

Preferentially, one would use a dedicated PHP function which is called from within the while loop, with the ID number of the hit structure as argument. This function could look like this:

function showHit($id) {
  //global $bitmapdir;  // if bitmap pictures are available (e.g. in GIF format)
  global $molfiledir;
  $result2 = mysql_query("SELECT mol_name FROM mol WHERE mol_id = $id")
    or die("Query failed");
  while ($line2 = mysql_fetch_array($result2, MYSQL_ASSOC)) {
    $txt = $line2[mol_name];
  }
  mysql_free_result($result2);  // looking up the molecule's name is necessary
                                // if we use matchmol in "burst mode"
  print "<p><b>$txt</b> ($id)</p>\n";
  // for faster display, we should have GIF files of the 2D structures
  // instead of invoking the JME applet multiple times:
  // print "<img src=\"${bitmapdir}/${id}.gif\">\n";

  // if no GIFs are available, we must invoking another instance of JME
  // in "depict" mode for structure display of each hit

  //$hitmol  = `cat ${molfiledir}/${id}.mol`;
  $hmfname   = "${molfiledir}/${id}.mol";
  $hmhandle  = fopen($hmfname, "r");
  $hitmol    = fread($hmhandle, filesize($hmfname));
  fclose($hmhandle);

  // JME needs MDL molfiles with the "|" character instead of linebreaks
  $jmehitmol = strtr($hitmol,"\n","|");
     
  echo "<applet code=\"JME.class\" codebase=\"/classes\" archive=\"JME.jar\" \n";
  echo "width=\"250\" height=\"120\">";
  echo "<param name=\"options\" value=\"depict\"> \n";
  echo "<param name=\"mol\" value=\"$jmehitmol\">\n";
  echo "</applet>\n";
}

[Screenshot]

Performance issues

Speed-up of SQL queries can be usually done by creation of indices wherever possible (see MySQL documentation). Another issue, however, is the performance of the checkmol/matchmol program, which does the critical part of the job in a structure/substructure search. Especially when a large number of candidate structures has to be subjected to an atom-by-atom match with the query structure, it can be very benefitial to reduce a) the number of matchmol invocations (by using "burst mode", see above) and b) matchmol's time of execution to a minimum. This can be achieved by keeping matchmol from performing the routines which find the set of all rings and determine whether they are aromatic or not. This task can be rather time-consuming in a large molecule with many rings (e.g., the molecular graph of cyanocobalamine contains more than 500 rings!). Instead, matchmol can read this information from slightly tweaked molfiles which can be generated in advance (e.g., whenever a new molecule is entered into the database), again with the "checkmol" program, using the "-m" command-line option. These "tweaked" molfiles are still compliant to MDL's format definition and can be processed by other applications without difficulty. In fact, the aromaticity flag for atoms and bonds is "hidden" as a leading zero ("0") digit in a particular column of the file. This extra information, however, is interpreted only by "matchmol". In any case, we do not have to trust the "original" MDL bond type of "aromatic" which can be defined by the user who draws the molecule (in many cases, aromatic structures are drawn using alternate single and double bonds, so relying just on user-defined "aromatic" bond types would be a high risk). Transformation of the entire content of the molfile directory into this enhanced format can be done very easily by this shell script:

#!/bin/sh
# optmol.sh
# transforms standard MDL molfiles into checkmol-enhanced
# molfiles containing aromaticity information irrespective
# of user-defined "aromatic" bond type

cd /data/htdocs/moldb/molfiles
if [ ! -d archive ]; then
  mkdir archive

  mv *.mol archive
fi
cd archive
for i in *.mol
do
  echo $i
  /usr/local/bin/checkmol -m $i > ../$i
done

Further options to improve performance

Conclusion

Based on a standard Linux/Apache/MySQL/PHP platform, it is possible to create a web-based chemical structure database with full structure/substructure search capabilities and, additionally, a powerful functional group search option. The "chemical intelligence" is simply added by a compact command-line utility program which is freely available under the GNU General Public License. Although the code examples given in this document (and the related scripts available for download) are not meant as a complete software package nor as detailed installation instructions, they should illustrate the principal steps to be taken in order to set up such a database system.
Update: in the meantime, a complete software package including installation instructions is available for download (see below)!


Download

A fully functional set of example PHP scripts together with auxiliary Perl scripts for data import from an SD file is available for download. In this version, molecular structures are stored as BLOB records in a MySQL table (rather than as individual files).

Download link: http://merian.pch.univie.ac.at/pch/download/chemistry/moldb/moldb5.tar.gz
Detailed instructions for installation, configuration, and operation can be found in the file INSTALL which is contained in this package and on the MolDB5R homepage.

An older version in which structures are stored as individual files is available on request.

Any feedback will be appreciated.

NH, 2003-12-04; last update: 2010-06-15


This page has been translated into Spanish language by Maria Ramos from Webhostinghub.com.