MolDB5R
a free software package for creating a web-based, searchable molecular
structure/reaction database
Norbert Haider, University of
Vienna, 2010-2013
norbert.haider@univie.ac.at
These instructions explain how to set up MolDB5R as a web-based, fully
searchable
molecular structure database. Structures and data can be added via a
web interface or by import from an MDL SD file (at the server
command-line). This import option
is shown by an example, using the SD files which are freely
available from the PubChem FTP site at ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/.
Analogously, reactions and data can be imported from an MDL RD file.
For some information about the original ideas, please visit http://merian.pch.univie.ac.at/~nhaider/cheminf/moldb.html,
additional detailed information can be found in this article: Haider,
N., Functionality Pattern Matching as an Efficient Complementary
Structure/Reaction Search Tool: an Open-Source Approach. Molecules, 15, 5079-5092
(2010). For citation in scientific publications, please use this reference rather than the project homepage.
The previous version, MolDB5 (without reaction support), is documented
at
http://merian.pch.univie.ac.at/~nhaider/cheminf/moldb5doc.html
MolDB5R is a collection of fully functional PHP scripts for running a
structure/reaction database with search options for text, functional
groups, and
structure/substructure. Included are also some Perl scripts for
database setup and data import
from an SD or RD file.
Download: http://merian.pch.univie.ac.at/pch/download/chemistry/moldb/moldb5r.tar.gz
Content
Features
Technical background
Installation
Using MolDB5R
Administration of MolDB5R
License
Appendix A (permissions
and privileges)
Appendix B (configuration
file settings)
Appendix C (performance
tuning)
Features
- Structure+data import from SD files, export to SD files
- Reaction+data import from RD files, export to RD files
- Text search
- Browse functioanlity
- Functional group search
- Extended functional group search in reaction data
collections (creation or loss of groups)
- Structure/substructure/similarity search in structure data
collections
- Structure/substructure search in reaction data collections
- Support for multiple data collections within one MySQL
database
- All search operations (text, functional groups,
substructure) can span multiple data collections
- Web-based administration tool for data collection
management, structure and data import/export/editing
- Support for Windows as operating system (experimental!) in
addition to Linux and Mac OS X.
Technical
background
The system uses MySQL as the database engine, whereas its "chemical
intelligence" is provided by the checkmol/matchmol program (or its
socket-based server version, cmmmsrv). All relevant data are stored in
MySQL tables, except the bitmap graphics of the molecules.
Structure/substructure searches are performed in a two-stage process,
consisting of an efficient pre-selection and a full atom-by-atom
comparison of the remaining candidates. For pre-selection, a
combination of molecular statistics with dictionary-based fingerprints
and hash-based fingerprints is used. Optionally, this operation can be
performed with memory-based MySQL tables instead of disk-based tables.
This will make the search even faster. In a test installation on a
standard PC (AMD Athlon 1.6 GHz CPU, 1.5 GB of memory), substructure
searches in a database of 500.000 compounds are typically complete
within a few seconds.
Demo
MolDB5R database
Installation
Hardware requirements:
any x86 PC or Mac (the faster, the better)
Software requirements:
- Linux
operating system (any distribution), Mac OS X or Windows (experimental!)
- Apache
web server 2.x
- MySQL
5.x or higher
- PHP
5.x with MySQL support
- Perl
5.x with the MySQL database driver module
==> these packages
are standard components of every Linux distribution (for Windows, the
XAMPP package can be used, for Mac OS X, the MAMP package is a good
choice)
moreover:
- checkmol/matchmol,
v0.4c or higher
==> open-source software, available from http://merian.pch.univie.ac.at/~nhaider/cheminf/cmmm.html
For faster operation, it is recommended to use the socket-based server
version of checkmol/matchmol (named cmmmsrv) instead.
- JME,
a Java applet for structure input, freely available
from the author, Peter Ertl (Novartis). Please visit http://www.molinspiration.com/jme/getjme.html
if you want to create bitmap graphics of your molecules:
- mol2ps/mol2svg
(version 0.2 or higher)
==> open source software, available from http://merian.pch.univie.ac.at/~nhaider/cheminf/mol2ps.html
if you want bitmap graphics also for reactions, mol2ps version 0.2 or
higher is required
- GhostScript
==> open source software, available from http://www.cs.wisc.edu/~ghost/
(usually included in the Linux distribution;for Mac OS X, it can be
most easily installed as part of the MacTeX package)
if you want to create SVG (scalable vector graphics) images of your
molecules:
- mol2ps/mol2svg
(version 0.4
or higher)
(see above)
Note: this package was
developed and
tested on a Linux machine, but
with a few adaptations it should run also on a FreeBSD
system.
Under Mac OS X (10.6 aka Snow Leopard), MolDB5R runs without any
modifications, but requires some non-trivial software installation:
GhostScript (see above) and the
Perl DBI and DBD::mysql modules. The Windows
development and (some) testing
was done on a virtualized Windows XP system with the XAMPP 1.7.0
package. Instead of the Perl component of XAMPP, ActivePerl 5.6.1 was
used and the missing database driver was installed via the PPM utility (ppm
install DBI
; ppm install DBD::MySQL
).
Where to start?
- Make sure your web server is working correctly with PHP
support (including the php5-mysql module).
- Check if MySQL is working correctly and the administrative
user (usually "root") has a password.
- Download and install checkmol/matchmol and mol2ps/mol2svg,
preferentially in
/usr/local/bin
(Note: you
do not have to compile
these programs yourself, there are pre-compiled binaries available for
Linux, Mac OS X and Windows).
- Install Ghostscript (if not already installed)
- Create a working directory (e.g., within your home
directory)
- Create a directory within the web server's
document root
directory and make sure you have write permissions there. This will be
the "MolDB5R web directory" which holds the PHP scripts etc.; create
the two subdirectories "admin" and "bitmaps" within the MolDB5R web
directory.
After downloading and unpacking the MolDB5R package in your working
directory, rename one of the three sample configuration files
(moldb5.conf.sample_linux, moldb5.conf.sample_macosx or
moldb5.conf.sample_windows) into
"moldb5.conf
". Load this file into a text
editor and adjust the
settings. A detailed overview of all parameters is given
in Appendix B.
Select a name for your database, define user IDs for a proxy
user with write privileges and another proxy user with
read-only privileges (the example file "moldb5.conf"
already contains these entries which can be accepted or modified;
please change at least the passwords!).
If you want to use bitmap graphics of your
molecules (highly recommended), please specify a directory where
these bitmaps should be stored and make sure you have
write permission there. This should be the final location of the .png
files
(somewhere within your web server's document root, preferentiall as a
subdirectory "bitmaps" of your MolDB5R web directory). If you don't
want
bitmaps, just set $bitmapdir
to
an empty string ("").
Initialize
the MolDB5R database
by running the Perl script "initdb.pl" (enter "perl
initdb.pl
", be sure to be in the same directory as the
script). You
will be prompted for the password of the MySQL administrator
(typically, this is the "root" user). The script will create the
appropriate MySQL database and two tables within this database (one for
metadata and one for the fragment dictionary). If everything goes well,
the script will produce this output:
setup of database $database OK
(where $database displays the name you have specified)
setup of fragment dictionary OK
Import of structures+data
from an SD file (*.sdf):
As an example, we use Compounds_000000001_000025000.sdf
(after unpacking the .gz file which you can download from the PubChem
FTP
site).
If you have wget installed, you can retrieve the PubChem file very
conveniently by entering e.g.
wget
ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_000000001_000025000.sdf.gz
Prerequisite: the checkmol/matchmol program must be installed (e.g., in
/usr/local/bin
, for more information, see
http://merian.pch.univie.ac.at/~nhaider/cheminf/cmmm.html).
These steps
should be executed exactly in this order:
- Change into the util/ subdirectory and run the Perl script
"sdfcheck.pl" with the name of the SD
file to be imported as the command-line
argument, e.g. "
perl sdfcheck.pl
Compound_000000001_000025000.sdf
" (add a leading "../" or
"../../" etc. to the SD filename if necessary). This
will create a definition file "sdf2moldb.def". By
default, only the first 10000 records are
inspected which should be sufficient in order to collect
the necessary information about the file structure.
- Load the file "sdf2moldb.def" (see above) into a text
editor and inspect it. Here, you can change the
types of MySQL columns or their dimensions. In
addition,
you MUST
rename the most descriptive field name (preferentially the one holding
the molecule's trivial name or its systematic
name) into "mol_name".
If you do not do this, you will have
to manually adjust all the PHP scripts in order to retrieve
the correct field. In our PubChem example, change the MySQL field
name (i.e., the 2nd column, between the first and the second
":") "pubchem_iupac_name" into "mol_name". Many numeric fields are
assumed to be
floating-point numbers (MySQL type "DOUBLE"), but in fact are integer
numbers, so their type should be changed into (e.g.)
"INT(11)". In our PubChem example, only two fields should be kept as
"DOUBLE": "pubchem_openeye_mw" and "pubchem_cactvs_xlogp". If you do
not need one or more of the recognized
SDF fields in your MySQL table, simply delete the
corresponding lines from the definition file. Make sure that there is
no field named "mol_id", because this is a reserved name of the MolDB5R
system.
- Import the content of your SD file into your MySQL database
by running the Perl script "sdf2moldb.pl" with the
name of the SD file to be imported as the command-line
argument, e.g. with "
perl sdf2moldb.pl
Compounds_000000001_000025000.sdf
".
Depending on the number of molecules in the SD
file, this operation may take some time (on a standard business PC,
approx. 1-5 h for 100000 records). It is recommended to do
this operation in a background job, most conveniently
in a "screen" session (screen is a utility program
which is included in most Linux distributions).
Note: you can create bitmap images of your structures
either during the initial data import (set "$make_bitmaps
= 1;
" in the sdf2moldb.pl script) or later (set "$make_bitmaps =
0;
").
In the latter case, data import will be significantly faster (about 3
times), bitmap files can then be created by running the updatebitmap.pl
script in the util/ subdirectory.
- If you want to add new records, repeat step 3 with another
SD file. By default, all data will be appended. ATTENTION:
all
SD files must have the same format (names and types
of data fields) as the first one. For example, you can
sequentially import all of the SD files from the PubChem site
(see above). (And don't forget to let me know your performance
figures for a database with >40 million
structures....) ;-)
- If you want to create a new data collection, repeat the
steps
described above, but change the "db_id=1" line in the sdf2moldb.def
file into "db_id=2" (or higher, if applicable).
Import of reactions+data
from an RD file (*.rdf)
Reactions with associated data can be imported in a very similar manner
as described for SDF import (see above). Please note that in the
present version, only "flat" (non-hierarchical) RD files can be handled.
- Run the Perl script "rdfcheck.pl" in the
util/ directory with the name of the RD
file to be imported as the command-line
argument.
- Edit the resulting definition file "rdf2moldb.def". Define
a
field "rxn_name" and make sure that there is no field "rxn_id". Make
sure that the db_id is set to a number which is still available.
- Import the RD file by running the Perl script
"rdf2moldb.pl" with
the RD file name as the command-line argument. Bitmap and/or SVG
depiction is provided in full analogy to SDF import.
Next, the web server part can be
installed:
- Of course, you must know the location of
your web server's
document root directory. This might be something like "/var/www/htdocs"
or
"/xampp/htdocs", etc. (in a typical Apache installation, it is
specified in the configuration file "httpd.conf" or in one of its
include files). On many websites, a
special subdirectory (e.g, "public_html") witin a user's home directory
can also be a valid location.
- Within the document root, you should have already created a
directory which will hold the MolDB5R PHP scripts, e.g. "moldb5r".
Usually, the "bitmaps" directory will be there as a subdirectory (e.g.,
"moldb5r/bitmaps"). Create another subdirectory named "admin" (e.g.,
"moldb5r/admin").
- Copy all the files from
the directory "wwwfiles" of this package to the MolDB5R web directory,
copy also all of the scripts in "wwwfiles/admin" to the "admin"
subdirectory of the web directory.
- The configuration file "moldb5conf.php" should contain
exactly
the same
values for all parameters as the setup
configuration file "moldb5.conf". In principle, "moldb5conf.php" is
nothing else than an exact copy of "moldb5.conf" with just two extra
lines (one at the beginning: "<?php ", and one at the end:
"?>").
For testing, this file can be placed also in the MolDB5R web directory.
In a publicly accessible production environment, however, this file
should be placed outside
the web server's document root, somewhere in the PHP include path
(check your php.ini file and/or the PHP docs). Keeping the
"moldb5conf.php" file within the document root is a security risk, as
it
contains plain-text passwords. Make sure the web server can read this
file and revoke read permission for other users. If you are using
bitmap graphics, make sure that the variable
$bitmapURLdir
points to the correct location, e.g. "/moldb5r/bitmaps" (no trailing
slash!).
- Make sure you have the JME Java applet installed. The file
"JME.jar" should be placed either in the same directory as the main
MolDB5R
PHP
scripts scripts or in a publicly accessible directory specified in the
$java_codebase
variable in the configuration file.
Note 1: it is
assumed that PHP is
not
running in "safe mode". If safe_mode is set to On in the php.ini
file, you have to make sure that checkmol/matchmol is allowed
to
be executed, this can be done by specifying the corresponding directory
in the variable safe_mode_exec_dir
in php.ini
(please check your PHP
documentation).
Note 2: starting with version 5R.01, MolDB supports other
character sets in addition to latin1 (which is still the default).
Please set the $charset
variable in the
configuration file(s) to either
"latin1", "latin2" or "utf8".
The subdirectory "util"
of the working directory contains some Perl scripts:
- initfpdef.pl (re-)initializes the fragment dictionary by
loading the structures stored in fp01.sdf into the appropriate table.
- similar to sdfcheck.pl and sdf2moldb.pl (see above), there
are
the scripts rdfcheck.pl and rdf2moldb.pl for analysis and import of
reactions from MDL RD files
- mkmolstat.pl (re-)generates the molstat table(s) (which
contains the molecular "statistics") by retrieving all structures and
analyzing them with checkmol
- mkmolfgb.pl (re-)generates the molfgb table(s) (which
contains
information about the functional groups) by retrieving all structures
and analyzing them with checkmol
- mkmolcfp.pl (re-)generates the molcfp table(s) (which
contains
the molecular "binary fingerprints", i.e., a combination of
dictionary-based and hash-based fingerprints) by retrieving all
structures and comparing them with the fragment dictionary which is
stored
in the fpdef table and in the fp01.sdf file; this is done with
matchmol, using the "-F" option; hash-based fingerprints are generated
by invocation of checkmol with the "-H" option.
- For reaction data collections, there are the analogous
scripts
mkrxnfgb.pl and mkrxncfp.pl (there is no mkrxnstat.pl, as the molstat
descriptors are not used for reactions).
- updatebitmap.pl creates bitmap images in png format within
the
bitmaps subdirectory for all new structures, using the pic2d
MySQL
table for housekeeping. If this script is run via a cron job (e.g.,
every night), the location of the configuration file should be
specified by its absolute path (in the line of the script starting with
"$configfile = ").
- updatesvg.pl creates SVG (scalable vector graphics) images
and stores them in the pic2d MySQL table. If this script is run via a
cron job (e.g., every night), the location of the
configuration file should be specified by its absolute path
(in
the line of the script starting with "$configfile = ").
The mk... scripts should be run whenever a new version of
checkmol/matchmol (or cmmmsrv, respectively) is installed on your
system.
The script cp2mem.pl will copy the persistent (disk-based) molstat and
molcfp MySQL
tables to memory-based MySQL tables, which are accessed
faster than the disk-based tables. In order to make use of these
tables, the moldb_meta table has to be modified (the "usemem" column
must be set to 'T'), this can be done with the Perl script "setmem.pl"
with an appropriate command-line argument, e.g. "perl
setmem.pl
1=T" (enabling memory-based tables for data collection 1). Do
not forget to run cp2mem.pl after enabling memory-based tables and
after each reboot of your machine,
otherwise the memory-based tables would be empty. Generally, this
option is recommended only for larger read-only data
collections.
There are also utility scripts for diagnostic purposes (listdb.pl),
(re-)creation of definition files from the MySQL table structure
(dumpdef.pl) and data export into SD or RD files (moldb2sdf.pl,
moldb2rdf.pl). All these scripts contain a short description of their
functionality.
That's it for the installation part. You may wish to adjust
the HTML layout of the index.php and moldb.css files and the PHP
scripts, replacing "MolDB5R" by something else. Especially, the script
"custom-inc.php" should be adopted to individual needs. If you
write any useful
extensions, I would be grateful
if you donate these scripts to the project. And a little link to the checkmol/matchmol
homepage would also be appreciated.
Using MolDB5R
Start page
On the start page (index.php), the user can select the data
collection(s) from the list box and clicking the "Apply selection"
button. Multiple selections (if enabled in the configuration file) can
be made by pressing the <Ctrl> key while clicking at the
list
item. Do not forget to click the "Apply selection" button. All search
options are available via the navigation bar at the
top
of the page. A link to the administration page is displayed at the
bottom (it may be advisable to remove this link...).
Browse
All selected
data
collections are available for browsing on this page, using standard
navigation elements. Compound structures are displayed graphically
(using bitmap or SVG pictures, if available), together with compound or
reaction names and
the ID number (mol_id or rxn_id). The latter represents a hyperlink to
a detailed
view of this particular entry (provided by the script "details.php").
Please note that inline SVG requires a fairly recent web browser (e.g.
any current Firefox, Chrome, Safari or [if you really want to use
an MS product] Internet Explorer 9 or higher).
Text search
There is a very simple text search tool which by default only searches
the
"mol_name" (and/or "rxn_name")
field. Search terms are not case-sensitive, as a wildcard character you
can use the "%" or "*" symbol. By selecting the option "include other
searchable fields", you can extend the search to all text fields which
have been defined as "searchable" by the administrator (see below).
Starting with MolDB5R.03, an advanced text search option is available
which allows entering search terms for each individual searchable data
field (these terms can be combined with logical AND or OR operators).
By nature, the advanced text search feature can be used only when a
single data collection is selected by the user.
Functional group search
This search option makes use of checkmol's capabilities to recognize
approx. 200 different functional groups. Categorisation of compounds is
performed already when they are entered into the database. The
properties are stored in binary format in a special table (molfgb), and
the search operation is extremely fast. Multiple selection of
functional groups from the list box can be made by pressing the
<Ctrl> key
while clicking at the desired item.
As soon as a reaction
data collection is selected, there are extended search options
available: the user can define whether the selected functional group(s)
should be present in any reactant or in any product or should be
created or lost during the reaction.
Structure search
This is the central search facility, permitting exact search,
substructure search, and similarity search. In the browser, Java and
Javascript must be enabled in order to display the JME applet. The
input structure must contain at least three atoms. Checkboxes are
available for more specific search modes (strict atom/bond type
comparison, geometry checks) if a search results in too many hits.
Structure search works also for reaction data collections. As an
alternative to drawing the query structure, it can be entered as text
in a pop-up window (in MDL molfile format).
Reaction structure search
As a new feature of MolDB5R, reaction data collections can be searched
for structures or substructures (no similarity search option). User
input is drawn in the JME applet and can be either one structure (as
reactant or as product) or multiple structures, e.g., one reactant and
one product (use the "new" button in JME to start a new molecule).
There is also basic support for atom mapping (provided the
corresponding mappings are present in the data collection).
Please note that reaction structure search is considerably slower than
"normal" structure search.
Administration of
MolDB5R
Data collection handling
Apart from data import from SD or RD files, new entries can be added
via the
administration frontend in the "admin" subdirectory. The main page
("admin/index.php") is used for adding new data collections and
editing/erasing existing ones. Besides the (recommended)
password
protection of this directory, the "trustedIP" setting restricts access
to these functioanlities (for a more detailed description of
privileges, see below).
On this page, only one data collection can be selected at a time. All
operations
(edit data collection properties, edit data field definitions, erase
selected data collection) will apply to this selected data collection.
Whereas
"edit data collection properties" and "erase selected data collection"
are self-explaining, the option "edit data field definitions" requires
some more explanation: here, you can add new fields to the records of
the selected data collection, determine their (MySQL) type, the name
which should display on the "details" output, and (optionally) a field
name for SDF/RDF export. A newly created data collection has only two
data
fields: "mol_id" (or "rxn_id", respectively; an integer number) and
"mol_name" (or "rxn_name") which cannot be changed. A data
collection which has been imported from an SD or RD file contains
(probably)
some more fields. Whenever you add a new field, you can choose its type
from the drop-down listbox or enter any valid MySQL column type by
selecting "other..." from the list and entering the correct type in the
input box. Caution: there is no syntax checking done by MolDB5R, wrong
input will result in MySQL errors. If you need some special MySQL
column types and options, you should use some more sophisticated
administration software like phpMyAdmin.
Note: changing the type of a data collection (from SD to RD or from RD
to SD) is only possible if the data collection is still empty.
Structure/data input
Using "Add/edit data records", you can enter new
structures/reactions+data or
edit existing ones. This is done by the script "editdata.php", again
all actions apply to the selected data collection.
Structure/data export
Using "Export data records", you can download structures/reactions as
well as their associated data (selectable for each field) in SDF or RDF
format, respectively. A list of entry numbers or number ranges can be
specified, otherwise all records of this data collection are exported.
Add new entry
Structures can be entered either by drawing, using the JME applet, or
by pasting text in MDL molfile format into the text area. After
submission, the structure is displayed and can again be edited if
desired. For structure data collections, a duplicate check is performed
by default (which can be disabled). After confirmation, the structure
is added to the data
collection and all relevant entries (molstat, functional group
patterns, fingerprints) are made in the appropriate tables. The data
entry form should be completed (at least, the name of the molecule
should be entered) and saved. Several option can be selected from the
"format" list for display of textual data:
- hidden: this field and its content are not displayed
- plain HTML: continuous text with no special formatting
- multiline: output within
<pre>......</pre>
tags, so that linebreaks
will be respected
- formula: all numbers will be subscripted
- URL: text will be displayed as a hyperlink; optionally, a
link label can be specified after a pipe symbol (|), for example: http://www.mydomain.org/mypage.html
| My Homepage
In the user frontend (browse, search, details), if SVG images are used
and your browser has SVG support, any newly entered structure or
reaction will be immediately displayed in full quality. Without SVG
support, newly added structures
are always displayed using the JME applet in
"depict" mode, until the bitmap graphics have been created by running
the "updatebitmap.pl" admin script. This should be done automatically
by a cron job.
Input of reactions is done analogously as described above, either by
drawing in JME or by pasting into the text box in MDL rxnfile format.
Atom maps should be created for as many atom pairs as possible.
Edit/erase entry
"Edit entry no." and "erase entry no." require a valid ID number
(mol_id/rxn_id) as input. If you do not know this number, you can find
it out
by doing a search or just browse the data collection. Whenever an entry
is erased, it is removed from all MySQL tables, but the original bitmap
file (if it exists) will not be deleted. Instead, the "updatebitmap.pl"
script will rename it into something like "00123456.png.to_be_deleted"
on its next run. You can then use some OS shell command in order to
actually delete these files (or modify the updatebitmap.pl script to do
a real deletion).
When editing a reaction,
an
option is presented to "extract molecules". This allows to copy any
reactant or product as a single structure to another (structure) data
collection.
Copy entry
The specified entry number can be either duplicated within the same
data collection (e.g., to create an entry for a closely related
structure/reaction by subsequent minor modification) together with all
textual/numeric data or it can be copied to a different data collection
(in this case, only as a structure/reaction without any textual/numeric
data, as the corresponding data fields may differ from collection to
collection).
License
Copyright © 2010-2013 Norbert Haider, University of
Vienna
This program is free software: You can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; version 3 of the License.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details: http://www.gnu.org/licenses/.
Appendix A
Permissions and privileges
MolDB5R does not implement its own user administration, but relies on
two access conditions:
- The web server's standard mechanism to protect a directory.
This is usually configured either in a
.htaccess
file in the directory in question or in the central configuration file
(typically /etc/apache2/httpd.conf
), using
the <Directory>
keyword.
Usually, a combination of valid client domain and password protection
(htpasswd) is specified there. Caution:
.htaccess
files are effective only if
enabled in the httpd.conf
file! Please refer
to the Apache documentation. It
is highly recommended to restrict access to the "admin" subdirectory by
this mechanism.
- The IP address of the client machine. Up to 10 "trusted IP"
addresses can be specified in the
moldb5conf.php
file: these addresses have all possible privileges within MolDB5R
(provided the user meets the access condition defined in .htaccess
,
see above). These users have the status of an "administrator". In
addition, "sub-administrators" can be specified on a
per-data-collection basis, listing their client IP address in the
moldb_meta table, field "trustedIP" (again, up to 10 IP addresses).
Who can do what?
Only the administrator(s)
can
- create a new data collection
- change the properties of a data collection (including the
definition of trusted IP addresses for sub-administrators of this data
collection), in particular setting access mode
("disabled", "read-only", "add/update", "full access")
- erase an existing data collection
The administrator(s)
and the sub-administrator(s)
of a data collection can
- add/edit/erase data records even to/from a "read-only" data
collection
- erase data records from an "add/update" data collection
- add data fields ("columns") or change the properties of
data
fields, erase data fields (with all their content!) even if the data
collection's access mode is not "full access"
Any user who
can access the "admin" subdirectory can
- add/edit/erase data records if the data collection is set
to "full access"
- add/edit data records if the data collection is set to
"add/update"
- add/edit/erase data fields if the data collection's access
mode is "full access"
Appendix B
Parameter settings in the
moldb5.conf / moldb5conf.php files
The configuration file moldb5.conf
in the
working directory and the PHP script moldb5conf.php
in the MolDB5R web directory are almost identical: the only difference
are the standard PHP tags "<?php
" and "?>
"
in the first and last line of moldb5conf.php
.
$database
= "moldb5r"; # name of the
MySQL database
$hostname =
"localhost"; # hostname of MySQL server, probably
"localhost"
$clientname =
"localhost"; # name of MySQL client, usually
"localhost"
$mysql_admin =
"root";
# MySQL administrator, usually "root"
$rw_user =
"mdb5-admin"; # proxy user with CREATE and INSERT privileges
$rw_password = "top-secret"; # (make sure
this script is unreadable to others)
$ro_user =
"mdb5-user"; # proxy user with SELECT privilege
$ro_password =
"secret"; # (better
avoid dollar signs etc.)
$drop_db =
"y";
#
erase entire database before re-creating it?
In this section, MySQL-specific settings have to be made: a databse
name has to be specified. If $drop_db is set to "y", this database will
be erased and recreated every time you run the initdb.pl script. The
machine on which the MySQL server is running is specified by $hostname.
Usually, the MySQL server and the web server are running on the same
machine, in this case $hostname is set to "localhost" and
$clientname is also set to "localhost". The name of the MySQL
administrator user must be specified, ususally this is "root" (not to
be confused with the system "root" user under Linux/Unix systems). The
two proxy user accounts in MySQL will be automatically created when
initdb.pl is run.
$tweakmolfiles =
"y";
# "y" or "n" ("y" gives better performance)
If this parameter is set to "y", the imported molfiles will be replaced
by (slightly) modified molfiles which contain some extra information
about aromaticity. These "tweaked" molfiles are generated by checkmol
and enable matchmol to do atom-by-atom comparisons (in a substructure
search) more efficiently (see Appendix C). Tweaked molfiles are
completely compatible
with other programs which can import MDL molfiles.
$prefix
=
"";
#
this allows to have different MolDB5R instances
# in one MySQL database; each MolDB5R instance can
# contain multiple data collections
As explained in the comment, you can use this extra prefix
to
keep more than one MolDB5R installation in a single MySQL database or
you can place MolDB5R together with other database applications (such
as
bulletin boards, wikis, blogs, etc.) in a single MySQL database. This
makes sense only in environments where you are restricted to use only
one MySQL database.
$sitename
= "MolDB5R demo"; # appears in title and headline
$charset =
"latin1"; # may be
"latin1", "latin2" or "utf8"
$CHECKMOL =
"/usr/local/bin/checkmol";
$MATCHMOL =
"/usr/local/bin/matchmol";
$use_cmmmsrv =
"y";
# set
to "y" if a cmmmsrv daemon is available
$cmmmsrv_addr = "127.0.0.1"; # must be
numeric
$cmmmsrv_port =
55624;
# the default cmmmsrv port is 55624
The $sitename string will appear in the header of all MolDB5R pages.
The $charset variable controls both HTML and MySQL handling of
character sets. Currently, only latin1, latin2 and utf8 are supported
(default: latin1). If other character sets are desired, please edit the
function set_charset() in the functions.php script.
In $CHECKMOL and $MATCHMOL, the executables of checkmol and matchmol
are specified. Under Linux, the programs are usually installed in the
/usr/local/bin
directory (most conveniently,
matchmol is just a hard
link to checkmol), under Windows it can be installed, for example, in
C:\bin
(but do not include the drive letter in
these values). Setting $CHECKMOL to "checkmol" would be also OK if the
program's
directory is included in the system's search path (beware: this must
apply also to the search path of the web server process).
Whereas all the backend Perl scripts rely on checkmol/matchmol, the PHP
portion of the package can also use the cmmmsrv daemon instead, if
installed on your system. This server process provides the same
functionality as the command-line programs (checkmol/matchmol), but
communicates with the web server via TCP sockets instead of shell
calls. This gives somewhat better performance (see Appendix C). In
principle, cmmmsrv
can run on a different machine than the web server, in this case
$cmmmsrv_addr must be set to the (numeric) IP address of this machine.
# the following variables control the 2D structure display
$enable_svg =
"y"; # first choice
$enable_bitmaps = "y"; # second choice
$enable_jme =
"y"; # structure editor; fallback for 2D display
The use of both SVG and bitmap graphics is recommended.
For SVG, mol2ps/mol2svg version 0.4 or higher is necessary and your web
browser must be fairly recent.
The following options are new, starting from MolDB5R version 5.20 and
are required for generating graphics in SVG format
$MOL2SVG = "/usr/local/bin/mol2svg";
$mol2svgopt = "--rotate=auto3Donly --hydrogenonmethyl=off --color=/usr/local/etc/color.conf"; # options for mol2svg, e.g. "--showmolname=on"
#$mol2svgopt = "--rotate=auto3Donly --hydrogenonmethyl=off"; # options for mol2svg, e.g. "--showmolname=on"
$mol2svgopt_rxn = "-R --rotate=auto3Donly --hydrogenonmethyl=off"; # options for mol2svg in reaction mode
$svg_scalingfactor = 1.0; # 1.0 gives good results
$svg_scalingfactor_rxn = 0.75; # 0.75 is a good compromise for reactions
The following section is needed for creating bitmap graphics with mol2ps and GhostScript:
$MOL2PS
= "/usr/local/bin/mol2ps";
$GHOSTSCRIPT = "/usr/bin/gs";
#$bitmapdir = "";
$bitmapdir =
"/var/www/htdocs/moldb5r/bitmaps"; # this is the base
directory
$bitmapURLdir = "/moldb5r/bitmaps";
$digits
= 8; # filenames will
be 00000001.png, etc.
$subdirdigits =
4; # uses the first x
digits of $digits (0 = no subdirectories)
#$mol2psopt =
"--rotate=auto3Donly
--hydrogenonmethyl=off --color=/usr/local/etc/color.conf"; # options
for mol2ps
$mol2psopt =
"--rotate=auto3Donly --hydrogenonmethyl=off"; # options for mol2ps,
e.g. "--showmolname=on"
$mol2psopt_rxn = "-R --rotate=auto3Donly --hydrogenonmethyl=off"; #
options for mol2ps in reaction mode
$scalingfactor =
0.22;
# 0.22 gives good results
$scalingfactor_rxn =
0.18; # 0.18 is a
good compromise for reactions
These options are relevant only if you want to use bitmap
graphics for 2D depiction of your molecular structures, otherwise set
$bitmapdir to an empty string ($bitmapdir = ""). $MOL2PS points to the
mol2ps program, $GHOSTSCRIPT to the Ghostscript program (under Windows,
this must be the console application "gswin32c.exe"). The directory
specified in $bitmapdir must be writable to you, but not necessarily to
the web server. Whenever new structures are entered via the web
frontend, bitmap files are not immediately created, but queued for
creation by an auxiliary script "updatebitmap.pl" which should be run
in regular intervals by the user who has write permission to
$bitmapdir. Most conveniently, this is done by a cron job, e.g. once a
day.
Within the bitmap directory, subdirectories are created automatically
by updatebitmap.pl (as well as sdf2moldb.pl) for each data collection,
using the number ("db_id") of that data collection as the directory
name. For larger data collection (>10000 structures), it is
advisable to further divide the directory hierarchy by specifying a
value of (e.g.) 4 as $subdirdigits and (e.g.) 8 as $digits for the PNG
file names. This would give paths to the actual graphics files like
/var/www/htdocs/moldb5r/bitmaps/1/0000/00000045.png
(in an URL, this
would be /moldb5r/bitmaps/1/0000/00000045.png
).
Thus, the maximum number
of PNG files in one directory can be limited, otherwise you might run
into problems with some shell commands.
For all available options of mol2ps, please refer to the mol2ps/mol2svg
homepage or just enter "mol2ps" on a command prompt. Please
note that for SVG images, mol2ps/mol2svg v0.4 or higher is required.
# further
settings=============================================================
$multiselect =
"y";
#
allow selection of multiple data collections
$default_db =
"1";
# default db_id
$trustedIP = "127.0.0.1,
192.168.0.10"; # max. 10 IP addresses
$fpdict_mode =
1; # 1 = auto adjust,
2 = force 64 bit, 3 = force 32 bit
$scratchdir =
"/data/moldb/moldb-scratch"; # needed by cp2mem.pl,
# must be writeable by mysql UID, too
$enablereactions = "y"; # if
not "y": no
reaction support
$enable_download = "y"; #
download option for hit structures/reactions
$download_limit =
100; # maximum number of hit
structures/reactions to be downloaded (per search)
$java_codebase = "/classes"; # either leave empty or specify URL of the
directory containing JME.jar
In this section, you can specify whether or not users can
select only
one data collection for searching/browsing or all available data
collections. If there are more than one data collections, $default_db
specifies the one which is automatically selected whenever a user
visits your MolDB5R website (this variable can contain a
comma-separated list of more than one collections, written in quotes
such as "1,2,4").
In $trustedIP, you can specify up to 10 (numeric) IP addresses for
extended administrative privileges. Besides protecting the "admin"
subdirectory with a password (using the standard mechanisms of the
Apache
web server, e.g. .htaccess files), the $trustedIP directive gives extra
protection. Only from a client machine listed in $trustedIP, you can
change the settings of a data collection (e.g. from read-only to
read/write) or erase an entire data collection. If you want to assign
special administrative priveleges to just one data collection to
somebody (e.g. for write access to a read-only data collection), use
the analogous "trusted IP address" specification for this particular collection in the web
administration frontend.
The parameter $fpdictmode is rather technical and determines the format
in which dictionary-based fingerprints are stored in the molcfp table
(either as unsigned 32-bit integers or as 64-bit integers). In the
first case, a MySQL column will store information about
presence/absence of 31 structural elements from the dictionary, in the
latter case, up the 62 bits will be used for this purpose. A setting of
"1" is recommended. Whenever the fragment dictionary is changed or
extended (by editing the file fp01.sdf and re-loading it with
initfpdef.pl), the molcfp tables have to be re-created by invoking the
scripts mkmolcfp.pl and mkrxncfp.pl..
The $scratchdir variable holds the name of a directory for temporary
files which are used by "cp2mem.pl" when it copies the content of
disk-based tables (molstat and molcfp) to memory-based tables (see
Appendix C). It is very important that not only the user who runs
"cp2mem.pl" has write access to this directory, but also the MySQL
server process (which usually runs as user "mysql").
The $enablereactions variable controls whether or not MolDB5R presents
its support for reaction databases to the user. If set to "n", the
"Reaction Search" menu item will not be displayed and only SD data
collections are presented for selection. Nevertheless, the
administration front-end will contain all the tools for reaction
handling.
The $enable_download and $download_limit variables are used to
enable/disable
download of hit structures/reactions after a search (all search types).
Only structures in SDF format or reactions in RDF format (maximum
number per search is defined by $download_limit), but none of the
associated textual/numeric data fields can be downloaded. For a full
export of
structures/reactions + data, use the "export data records" button on
the administration page.
In the $java_codebase parameter, you can specify the location (URL) of
the JME applet (JME.jar). If JME.jar is in the same directory as the
other MolDB5R php-scripts, just set this parameter to an empty string
(""). If MolDB5R is installed in a password-protected directory of your
web server (using the .htaccess mechanism), it is advisable to place
JME.jar somewhere else in a publicly accessible directory in order to
avoid additional authentication requests by the user's Java runtime
environment. For instance, if we assume your web server's document root
is "/var/www/htdocs" and MolDB5R is installed in a password-protected
directory "/var/www/htdocs/moldb5r" (the URL would look like
http://www.somedomain.com/moldb5r"), you could place JME.jar in
"/var/www/htdocs/classes" and thus set the $java_codebase parameter to
"/classes".
The last part of the configuration file contains various definitions
for table names and should not be changed, unless you have a very good
reason to do so. Please note that some new definitions (for reaction
tables) have been added in MolDB5R which were not present in MolDB5 and
previous versions.
Appendix C
Performance tuning
There are three options which should be considered when maximum
performance in substructre searches is desired:
- Enabling the use of "tweaked"
molfiles
for internal storage of structures instead of regular molfiles. This is
a feature of checkmol/matchmol: information about aromatic
substructures (per atom and per bond) is stored in unused columns of
the MDL molfile connection table. Whenever matchmol encounters such a
"tweaked" molfile, it can skip lengthy ring detection and aromaticity
determination routines (for a more detailed description, please check
the checkmol/matchmol
homepage).
The use of "tweaked" molfiles is enabled by default in the configuration file (see
Appendix B), it is highly recommended to leave this parameter set to "y".
- Using cmmmsrv
instead of
checkmol/matchmol: cmmmsrv is the socket-based server equivalent to the
command-line programs, checkmol and matchmol. The PHP scripts
communicate with cmmsrv in a similar way as with checkmol/matchmol, but
instead of shell calls (via the
popen()
function of PHP),
a TCP socket connection is established to the cmmmsrv process for
reading and writing. This allows for larger blocksizes to be
transferred between PHP and the cheminformatics engine, thus keeping
the overhead smaller. Please note that cmmmsrv is available only for
Linux (and Mac OS X, FreeBSD), not for Windows.
If you prefer checkmol/matchmol instead of cmmmsrv, there is another
performance improvement possible on a Linux system: replace the
standard shell "bash" by the smaller and faster "ash" shell. This can
be done by removing the symbolic link "/bin/sh
"
(which by default points to "/bin/bash
" and
creating a new symbolic link "/bin/sh
" which
points to "/bin/ash
" (thanks to Ernst-Georg
Schmid for this hint). Caution:
some Linux distributions may rely on /bin/sh
to be a symbolic link to /bin/bash
for their
startup scripts (in /etc/init.d
). To be on
the safe side, replace all occurrences of "#!/bin/sh
"
in these script by "#!/bin/bash
".
- Using memory-based
MySQL tables
instead of disk-based tables: the two tables "molstat" and "molcfp" are
particularly critical in terms of search performance for the fast
pre-selection process. MySQL offers the possibility to keep tables
entirely in memory, whereas only their data structure is permanently
stored on disk. These memory-based tables are accessed considerably
faster than their disk-based counterparts. As a disadvantage, the
content of memory-based tables is lost every time the MySQL server
process is restarted. In MolDB5R, a combination of disk-based tables
and
memory-based copies of the original (persistent) tables is used. This
means, all the information about molecular statistics (molstat) and
fingerprints (molcfp) is permanently stored in the disk-based tables,
and as an option you can enable copies of these tables as memory-based
tables. In order to make use of this option, you have to set the
"usemem" flag in the moldb_meta table to "T" for the data collection in
question (this is done with the Perl script "setmem.pl") and
subsequently you have to copy the entire content of the molstat and
molcfp tables to memory (this is done with the Perl script "cp2mem.pl"
or its alternative script "cp2mem_slow.pl"; the latter does not require
write permissions for the MySQL process to a scratch directory). Please
note that the use of memory-based tables is recommended only for
"read-only" data collections, as any change to a chemical structure in
a data collection will disable the memory-based tables until you run
"cp2mem.pl" for the next time. For large data collections (e.g., with 1
million structures), it will be necessary to have sufficient RAM in
your machine and you probably have to adjust some settings in the MySQL
configuration file (my.cnf): set
max_heap_table_size
and tmp_table_size
to sufficiently high
values.
NH, 2010-06-11; last update:
2013-06-19