MolDB6
a free software package for creating a web-based, searchable molecular
structure/reaction database
Norbert Haider, University of
Vienna, 2014-2021
norbert.haider@univie.ac.at
These instructions explain how to set up MolDB6 as a web-based, fully
searchable
molecular structure database. Structures and data can be added via a
web interface or by import from an MDL SD file (at the server
command-line). This import option
is shown by an example (see below), using the SD files which are freely
available from the PubChem FTP site at ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/.
Analogously, reactions and data can be imported from an MDL RD file.
For some information about the original ideas, please visit
https://homepage.univie.ac.at/norbert.haider/cheminf/moldb.html
The previous version, MolDB5R, is documented at
https://homepage.univie.ac.at/norbert.haider/cheminf/moldb5rdoc.html
MolDB6 is a collection of fully functional PHP scripts for running a
structure/reaction database with search options for text, functional
groups, and
structure/substructure. Included are also some Perl scripts for
database setup and data import/export
from/to an SD or RD file.
Download: https://homepage.univie.ac.at/norbert.haider/download/chemistry/moldb/moldb6.tar.gz
Some screenshots: https://homepage.univie.ac.at/norbert.haider/cheminf/moldb6screenshots.html
Content
Features
Technical background
Installation
Using MolDB6
Administration of MolDB6
License
Appendix A (permissions
and privileges)
Appendix B (configuration
file settings)
Appendix C (performance
tuning)
Features
- Structure+data import from SD files, export to SD files
- Reaction+data import from RD files, export to RD files
- Text search
- Browse functioanlity
- Functional group search
- Extended functional group search in reaction data
collections (creation or loss of groups)
- Structure/substructure/similarity search in structure data
collections
- Structure/substructure search in reaction data collections
- Support for multiple data collections within one MySQL
database
- All search operations (text, functional groups,
substructure) can span multiple data collections
- Web-based administration tool for data collection
management, structure and data input/editing
- Runs on different operating systems: Linux, Mac OS X,
Windows
New features in MolDB6 as
compared to MolDB5R:
- Support for different graphical structure editors (JME,
JSME, GGA Ketcher, FlaME), user-selectable
- When used with JSME, mobile devices (iOS and Android smartphones and tablets) are fully supported as clients
- Structure and reaction structure search are combined in one
search form
- Automatically generated data fields for molecular formula
and molecular weight
- Optional: automaticallly generated data field for InChIKey strings
- Short-cut links for copying and erasing (in addition to
editing) an entry from the "details" page
- Easy cloning of the data structure of existing data
collections
Technical
background
The system uses MySQL as the database engine, whereas its "chemical
intelligence" is provided by the checkmol/matchmol program (or its
socket-based server version, cmmmsrv). All relevant data are stored in
MySQL tables, except the bitmap graphics of the molecules.
Structure/substructure searches are performed in a two-stage process,
consisting of an efficient pre-selection and a full atom-by-atom
comparison of the remaining candidates. For pre-selection, a
combination of molecular statistics with dictionary-based fingerprints
and hash-based fingerprints is used. Optionally, this operation can be
performed with memory-based MySQL tables instead of disk-based tables.
This will make the search even faster. For structure data collections,
there is also a similarity search implemented with a user-selectable
ratio of structural vs. functional similarity.
Demo
MolDB6 database
Installation
Hardware requirements:
any x86 PC or Mac (the faster, the better)
Software requirements:
- Linux
operating system (any distribution), Mac OS X or Windows (experimental!)
- Apache
web server 2.x
- MySQL
5.x or higher (server and client software)
- PHP
5.x or 7.x
- Perl
5.x
==> these packages
are standard components of every Linux distribution (for Windows, the
XAMPP package can be used, for Mac OS X, the MAMP package is a good
choice)
moreover:
- checkmol/matchmol,
v0.5 or higher
==> open-source software, available from https://homepage.univie.ac.at/norbert.haider/cheminf/cmmm.html
For faster operation, it is recommended to use the socket-based server
version of checkmol/matchmol (named cmmmsrv) instead. This software is
not available for the Windows operating system.
- At least one
of these graphical structure editor:
- mol2ps/mol2svg, v0.4b or higher
==> open source software, available from https://homepage.univie.ac.at/norbert.haider/cheminf/mol2ps.html
- Optional software:
- GhostScript (if you want to supply also bitmap images of the molecular structures, in addition to SVG graphics)
==> open source software, available from http://www.cs.wisc.edu/~ghost/
(usually included in the Linux distribution; for Mac OS X, it can be
most easily installed as part of the MacTeX package)
- The IUPAC inchi-1
program (if you want to automatically generate InChIKey strings for
your molecules); this software is available as source code and in
binary version (for Windows and Linux) from the project website at http://www.inchi-trust.org/downloads/
- OpenBabel: the open
source chemistry toolbox (if you want to produce energy-minimized 3D
structures from your 2D input; an example script [mk3d.pl] is provided
in order to show how the
obminimize
command can be used for this purpose), see http://openbabel.org/wiki/Main_Page
- Jmol: an open-source
Java viewer for chemical structures in 3D (if you have 3D molfiles to
display; for generating them, see above); available from http://jmol.sourceforge.net/
- JSmol (containing the
JSV package): an open-source JavaScript replacement for Jmol (see
above) that can be used to display 3D structures in MDL molfile format
and spectra in JCAMP-DX format. Available from https://sourceforge.net/projects/jsmol/
Note: the MolDB6 package was
developed and
tested on a Linux machine, but
(maybe with a few adaptations) it should run also on a FreeBSD
system.
Under Mac OS X (tested on 10.6 aka Snow Leopard), MolDB6 runs without any
modifications, but requires some non-trivial software installation:
GhostScript (see above) and the
Perl DBI and DBD::mysql modules. The Windows
development and (some) testing
was done on a virtualized Windows XP system with the XAMPP 1.7.0
package. Instead of the Perl component of XAMPP, ActivePerl 5.6.1 was
used and the missing database driver was installed via the PPM utility (ppm
install DBI
; ppm install DBD::MySQL
).
Also with some Linux distributions, it may be necessary to add the Perl
driver for MySQL, e.g. in CentOS 8, you would enter the following
command: dnf install perl-DBD-MySQL
Where to start?
- Make sure your web server is working correctly with PHP
support (including the php5-mysql module or its php7 counterpart, respectively).
- Check if MySQL is working correctly and if the
administrative
user (usually "root") has a password. Please note that on some Linux
distributions, the MySQL "root" account is configured differently, e.g.
without a password, but accessible only by the Linux "root" user. In
such a case, it is useful to first create manually (using the
command-line mysql client as Linux "root") another administrative MySQL
account, e.g. named "admin" who has a secure password and who has all
the privileges for all databases, ioncluding the GRANT option.
- Download and install checkmol/matchmol and mol2ps/mol2svg,
preferentially in
/usr/local/bin
(Note: you
do not have to compile
these programs yourself, there are pre-compiled binaries available for
Linux, Mac OS X and Windows).
- Optionally install GhostScript (if not already
installed). THis option is only necessary if you want to use bitmap
images inszead of SVG graphics, i.e. if you expect client access with
very old web browsers.
- Create a working directory (e.g., within your home
directory)
- Create a directory within the web server's
document root
directory and make sure you have write permissions there. This will be
the "MolDB6 web directory" which holds the PHP scripts etc.; create
the two subdirectories "admin" and "bitmaps" within the MolDB6 web
directory.
- Download at least one of the graphical structure
editors (e.g.,
JSME) and install it within the web server's document root (either
inside or outside the MolDB6 web directory). In the case of JSME (which
definitely a good chouice), you need only the "jsme" directory and all
of its contents. This is the directory that contains the
jsme.nocache.js file. Thus, a typical installation path would be
/var/www/htdocs/jsme/jsme.nocache.js (and associated files + 2
subdirectories).
After downloading and unpacking the MolDB6 package in your working
directory, rename or copy the sample configuration file into
"moldb6.conf
". Load this file into a text
editor and adjust the
settings. A detailed overview of all parameters is given
in Appendix B.
Select a name for your database, define user IDs for a proxy
user with write privileges and another proxy user with
read-only privileges (the example file "moldb6.conf"
already contains these entries which can be accepted or modified;
please change at least the passwords!).
If you want to use bitmap graphics of your
molecules (recommended only if you want to support clients with outdated browsers), please specify a directory where
these bitmaps should be stored and make sure you have
write permission there. This should be the final location of the .png
files
(somewhere within your web server's document root, preferentiall as a
subdirectory "bitmaps" of your MolDB6 web directory). If you don't
want
bitmaps, just set $bitmapdir
and $bitmapURLdir
to
an empty string ("") and set $enable_bitmaps = "n";
in the moldb6uiconf.php
file.
Initialize
the MolDB6 database
by running the Perl script "initdb.pl" (enter "perl
initdb.pl
",
be sure to be in the same directory as the
script). You
will be prompted for the password of the MySQL administrator
(typically, this is the "root" MySQL user, not to be confused with the
"root" operating system user). The script will create the
appropriate MySQL database and two tables within this database (one for
metadata and one for the fragment dictionary). Moreover, the two proxy
user accounts are created and their passwords set as specified in the
configuration file. If everything goes well,
the script will produce this output:
setup of database $database OK
(where $database displays the name you have specified)
setup of fragment dictionary OK
Please note that this Perl script does not make use of the MySQL
database driver module, but just calls the "mysql" command-line
program. Thus, it works also on systems where the appropriate driver
modules are not installed. Depending on your version of the MySQL server, it might be necessary to adjust the contents of the $oldmysql
variable in the header of the initdb.pl script.
Import of structures+data
from an SD file (*.sdf):
As an example, we use Compounds_00000001_00025000.sdf
(after unpacking the .gz file which you can download from the PubChem
FTP
site).
If you have wget
installed, you can retrieve the PubChem file very
conveniently by entering e.g.
wget https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_000000001_000500000.sdf.gz
Prerequisite: the checkmol/matchmol program and the mol2ps/mol2svg program must be installed (e.g., in
/usr/local/bin
, for more information, see
https://homepage.univie.ac.at/norbert.haider/cheminf/cmmm.html and https://homepage.univie.ac.at/norbert.haider/cheminf/mol2ps.html).
These steps
should be executed exactly in this order:
- Change into the util/ subdirectory and run the Perl script
"sdfcheck.pl" with the name of the SD
file to be imported as the command-line
argument, e.g. "
perl sdfcheck.pl
Compound_000000001_000500000.sdf
" (add a leading "../" or
"../../" etc. to the SD filename if necessary). This
will create a definition file "sdf2moldb.def". By
default, only the first 10000 records are
inspected which should be sufficient in order to collect
the necessary information about the file structure.
- Load the file "sdf2moldb.def" (see above) into a text
editor and inspect it. Here, you can change the
types of MySQL columns or their dimensions. In
addition,
you MUST
rename the most descriptive field name (preferentially the one holding
the molecule's trivial name or its systematic
name) into "mol_name".
If you do not do this, you will have
to manually adjust all the PHP scripts in order to retrieve
the correct field. In our PubChem example, change the MySQL field
name (i.e., the 2nd column, between the first and the second
":") "pubchem_iupac_name" into "mol_name". Many numeric fields are
assumed to be
floating-point numbers (MySQL type "DOUBLE"), but in fact are integer
numbers, so their type should be changed into (e.g.)
"INT(11)". In our PubChem example, only fields such as "pubchem_openeye_mw" and "pubchem_cactvs_xlogp" should be kept as
"DOUBLE". If you do
not need one or more of the recognized
SDF fields in your MySQL table, simply delete the
corresponding lines from the definition file. Make sure that there is
no field named "mol_id", because this is a reserved name of the MolDB6
system. There should also be no field names starting with "auto_mol_", as they are reserved labels in MolDB6.
- Import the content of your SD file into your MySQL database
by running the Perl script "sdf2moldb.pl" with the
name of the SD file to be imported as the command-line
argument, e.g. with "
perl sdf2moldb.pl
Compound_000000001_000500000.sdf
".
Depending on the number of molecules in the SD
file, this operation may take some time (on a standard business PC,
approx. 1 h for 10000 records). It is recommended to do
this operation in a background job, most conveniently
in a "screen" session (screen is a utility program
which is included in most Linux distributions).
Note: you can create bitmap images of your structures
either during the initial data import (set "$make_bitmaps
= 1;
" in the sdf2moldb.pl script) or later (set "$make_bitmaps =
0;
").
In the latter case, data import will be significantly faster, bitmap
files can then be created (if needed at all) by running the
updatebitmap.pl
script in the util/ subdirectory.
- If you want to add new records, repeat step 3 with another
SD file. By default, all data will be appended. ATTENTION:
all
SD files that are intended for the same data collection must have the same format (names and types
of data fields) as the first one. For example, you can
sequentially import all of the SD files from the PubChem site
(see above). (And don't forget to let me know your performance
figures for a database of >150 million
structures....) ;-)
- If you want to create a new data collection, repeat the
steps
described above, but change the "db_id=1" line in the sdf2moldb.def
file into "db_id=2" (or higher, if applicable).
Import of reactions+data
from an RD file (*.rdf)
Reactions with associated data can be imported in a very similar manner
as described for SDF import (see above). Please note that in the
present version, only "flat" (non-hierarchical) RD files can be handled.
- Run the Perl script "rdfcheck.pl" in the
util/ directory with the name of the RD
file to be imported as the command-line
argument.
- Edit the resulting definition file "rdf2moldb.def". Define
a
field "rxn_name" and make sure that there is no field "rxn_id". Make
sure that the db_id is set to a number which is still available.
- Import the RD file by running the Perl script
"rdf2moldb.pl" with
the RD file name as the command-line argument. Bitmap and SVG depiction works in the same way as for structure data collections.
Next, the web server part can be
installed:
- Of course, you must know the location of
your web server's
document root directory. This might be something like "/var/www/htdocs"
or
"/xampp/htdocs", etc. (in a typical Apache installation, it is
specified in the configuration file "httpd.conf" or in other "*.conf" files that are consulted on start-up). On many websites, a
special subdirectory (e.g, "public_html") witin a user's home directory
can also be a valid location.
- Within the document root, you should have already created a
directory which will hold the MolDB6 PHP scripts, e.g. "moldb6".
Usually, the "bitmaps" directory will be there as a subdirectory (e.g.,
"moldb6/bitmaps"). Create another subdirectory named "admin" (e.g.,
"moldb6/admin").
- Copy all the files from
the directory "wwwfiles_php5" or "wwwfiles_php7" (depending on your PHP version) of this package to the MolDB6 web directory,
copy also all of the scripts in "wwwfiles_php5/admin" or "wwwfiles_php7/admin" to the "admin"
subdirectory of the web directory.
- The main configuration file "moldb6conf.php" should contain
exactly
the same
values for all parameters as the setup
configuration file "moldb6.conf". In principle, "moldb6conf.php" is
nothing else than an exact copy of "moldb6.conf" with just two extra
lines (one at the beginning: "<?php ", and one at the end:
"?>").
For testing, this file can be placed also in the MolDB6 web directory.
In a publicly accessible production environment, however, this file
should be placed outside
the web server's document root, somewhere in the PHP include path
(check your php.ini file and/or the PHP docs). Keeping the
"moldb6conf.php" file within the document root is a security risk, as
it
contains plain-text passwords. Make sure the web server can read this
file and revoke read permission for other users. If you are using
bitmap graphics, make sure that the variable
$bitmapURLdir
points to the correct location, e.g. "/moldb6/bitmaps" (no trailing
slash!).
- Make sure you have at least one of the supported structure
editors installed (e.g., JSME). The location of the editor(s) must be
specified in the second configuration file,
moldb6uiconf.php
(an example is provided in the wwwfiles directory). This file should also contain the location of the JSmol package, if used.
Note 1: it is
assumed that PHP is
not running in "safe mode". If safe_mode is set to "On" in the php.ini
file, you have to make sure that checkmol/matchmol and mol2ps/mol2svg are allowed
to
be executed, this can be done by specifying the corresponding directory
in the variable safe_mode_exec_dir
in php.ini
(please check your PHP
documentation).
Note 2: MolDB6 supports also other
character sets in addition to latin1 (which is the default).
Please set the $charset variable in the configuration file(s) to either
"latin1", "latin2" or "utf8".
The subdirectory "util"
of the working directory contains some further Perl scripts:
- initfpdef.pl (re-)initializes the fragment dictionary by
loading the structures stored in fp01.sdf into the appropriate table.
- similar to sdfcheck.pl and sdf2moldb.pl (see import example above), there
are
the scripts rdfcheck.pl and rdf2moldb.pl for analysis and import of
reactions from MDL RD files
- mkmolstat.pl (re-)generates the molstat table(s) (which
contains the molecular "statistics") by retrieving all structures and
analyzing them with checkmol
- mkmolfgb.pl (re-)generates the molfgb table(s) (which
contains
information about the functional groups) by retrieving all structures
and analyzing them with checkmol
- mkmolcfp.pl (re-)generates the molcfp table(s) (which
contains
the molecular "binary fingerprints", i.e., a combination of
dictionary-based and hash-based fingerprints) by retrieving all
structures and comparing them with the fragment dictionary which is
stored
in the fpdef table and in the fp01.sdf file; this is done with
matchmol, using the "-F" option; hash-based fingerprints are generated
by invocation of checkmol with the "-H" option.
- For reaction data collections, there are the analogous
scripts
mkrxnfgb.pl and mkrxncfp.pl (there is no mkrxnstat.pl, as the molstat
descriptors are not used for reactions).
- updatebitmap.pl creates bitmap images in png format within
the
bitmaps subdirectory for all new structures, using the pic2d
MySQL
table for housekeeping. If this script is run via a cron job (e.g.,
every night), the location of the configuration file should be
specified by its absolute path (in the header of the script).
- updatesvg.pl (re-)generates SVG (scalable vector graphics) images
of all structures and reactions and stores them in the appropriate
column in the pic2d table.
The mk... scripts should be run whenever a new version of
checkmol/matchmol (or cmmmsrv, respectively) is installed on your
system.
The script cp2mem.pl will copy the persistent (disk-based) molstat and
molcfp MySQL
tables to memory-based MySQL tables, which are accessed
faster than the disk-based tables. In order to make use of these
tables, the moldb_meta table has to be modified (the "usemem" column
must be set to 'T'), this can be done with the Perl script "setmem.pl"
with an appropriate command-line argument, e.g. "perl
setmem.pl
1=T" (enabling memory-based tables for data collection 1). Do
not forget to run cp2mem.pl after enabling memory-based tables and
after each reboot of your machine,
otherwise the memory-based tables would be empty. Generally, this
option is recommended only for larger read-only data
collections.
There are also utility scripts for diagnostic purposes (listdb.pl),
(re-)creation of definition files from the MySQL table structure
(dumpdef.pl) and data export into SD or RD files (moldb2sdf.pl,
moldb2rdf.pl). All these scripts contain a short description of their
functionality.
Tools for the import of very large data sets from SD files:
Because the "normal" data import from an SD file using the
"sdf2moldb.pl" script can be rather time-consuming, there is now (2016
version) an alternative method available: the import task is split into
(a) the creation of one or more SQL files that contain all relevant
information and simple placeholders for the unique mol_id numbers, and
(b) loading of these pre-assembled SQL files into the database. The
first sub-task is accomplished by the script "sdf2sql.pl", it is
computationally "expensive" and can be run in parallel on several
independent machines in order to generate all the desired SQL files.
The latter then can be copied to the server machine and they are loaded
very fast by the simple script "sql2moldb.pl", this script only takes
care of replacing the preliminary mol_id placeholders with the correct
ID numbers in ascending order. For more information, please consult the
explanations in the script code.
That's it for the installation part. You may wish to adjust
the HTML layout of the index.php (preferentially the custom-inc.php) and moldb.css files and the PHP
scripts, replacing "MolDB6" by something else. If you write any useful
extensions, I would be grateful
if you donate these scripts to the project. And a little link to the checkmol/matchmol
homepage would also be appreciated.
Using MolDB6
Start page
On the start page (index.php), the user can select the data
collection(s) from the list box and clicking the "Apply selection"
button. Multiple selections (if enabled in the configuration file) can
be made by pressing the <Ctrl> key while clicking at the
list
item. All search options are available via the navigation bar at the
top
of the page. A link to "Preferences" (if enabled) and to the
administration page is displayed at the right-hand side of the menu (it
may be advisable to remove this link...).
Browse
All selected
data
collections are available for browsing on this page, using standard
navigation elements. Compound structures are displayed graphically
(using SVG or bitmap pictures, if available), together with compound or
reaction names and
the ID number (mol_id or rxn_id). The latter represents a hyperlink to
a detailed
view of this particular entry (provided by the script "details.php").
The kind of graphics used (SVG or bitmaps) will depend on the settings
in moldb6uiconf.php and on the capabilities of the client's
browser.
Text search
There is a simple text search tool which by default only searches
the
"mol_name" (and/or "rxn_name")
field. Search terms are not case-sensitive, as a wildcard character you
can use the "%" or "*" symbol. By selecting the option "include other
searchable fields", you can extend the search to all text fields which
have been defined as "searchable" by the administrator (see below). An advanced text search option is also available
which allows entering search terms for each individual searchable data
field (these terms can be combined with logical AND or OR operators).
By nature, the advanced text search feature can be used only when a
single data collection is selected by the user.
Functional group search
This search option makes use of checkmol's capabilities to recognize
approx. 200 different functional groups. Categorisation of compounds is
performed already when they are entered into the database. The
properties are stored in binary format in a special table (molfgb), and
the search operation is extremely fast. Multiple selection of
functional groups from the list box can be made by pressing the
<Ctrl> key
while clicking at the desired items.
As soon as a reaction
data collection is selected, there are extended search options
available: the user can define whether the selected functional group(s)
should be present in any reactant or in any product or should be
created or lost during the reaction.
Combined structure and reaction search input
By clicking on the light-grey "no structure defined" image, a
pop-up window opens. It contains the structure editor (according to the
settings in the "Preferences" dialogue) and lets the user draw an input
structure or reaction. By clicking on "Submit to search form", this
structure (or reaction) is copied to the search form and the pop-up
window closes. Depending on the type of search item (structure or
reaction), the search form displays slightly different options (see
below).
Structure search
This is the central search facility, permitting exact search,
substructure search, and similarity search. The
input structure must contain at least three atoms. Checkboxes are
available for more specific search modes (strict atom/bond type
comparison, geometry checks) if a search results in too many hits.
Structure search works also for reaction data collections. As an
alternative to drawing the query structure, it can be entered as text
in a pop-up window (in MDL molfile format).
Reaction structure search
Reaction data collections can be searched
for structures or substructures (no similarity search option). User
input is drawn in the pop-up window as explained above and can be either one structure (as
reactant or as product) or multiple structures, e.g., one reactant and
one product (use the "new" button in JME/JSME to start a new molecule).
There is also basic support for atom mapping (provided the
corresponding mappings are present in the data collection).
Administration of
MolDB6
Data collection handling
Apart from data import from SD or RD files, new entries can be added
via the
administration frontend in the "admin" subdirectory. The main page
("admin/index.php") is used for adding new data collections and
editing/erasing existing ones. Besides the (recommended)
password
protection of this directory, the "trustedIP" setting restricts access
to these functioanlities (for a more detailed description of
privileges, see below)..
On this page, only one data collection can be selected at a time. All
operations
(edit data collection properties, edit data field definitions, erase
selected data collection) will apply to this selected data collection.
Whereas
"edit data collection properties" and "erase selected data collection"
are self-explaining, the option "edit data field definitions" requires
some more explanation: here, you can add new fields to the records of
the selected data collection, determine their (MySQL) type, the name
which should display on the "details" output, and (optionally) a field
name for SDF/RDF export. A newly created data collection has only two
data
fields: "mol_id" (or "rxn_id", respectively; an integer number) and
"mol_name" (or "rxn_name") which cannot be changed. Moreover, all new
structure (not reaction) data collections by default contain the fields
"auto_mol_formula" and "auto_mol_fw". A data
collection which has been imported from an SD or RD file contains
(probably)
some more fields. Whenever you add a new field, you can choose its type
from the drop-down listbox or enter any valid MySQL column type by
selecting "other..." from the list and entering the correct type in the
input box. Caution: there is no syntax checking done by MolDB6, wrong
input will result in MySQL errors. If you need some special MySQL
column types and options, you should use some more sophisticated
administration software like phpMyAdmin.
Note: changing the type of a data collection (from SD to RD or from RD
to SD) is only possible as long as the data collection is still empty.
Structure/data input
Using "Add/edit data records", you can enter new
structures/reactions+data or
edit existing ones. This is done by the script "editdata.php", again
all actions apply to the selected data collection.
Add new entry
Structures can be entered either by drawing, using the JME or JSME applet, or
by pasting text in MDL molfile format into the text area. After
submission, the structure is displayed and can again be edited if
desired. After confirmation, the structure is added to the data
collection and all relevant entries (molstat, functional group
patterns, fingerprints) are made in the appropriate tables. The data
entry form should be completed (at least, the name of the molecule
should be entered) and saved. Several options can be selected from the
"format" list for display of textual data:
- hidden: this field and its content are not displayed
- plain HTML: continuous text with no special formatting
- multiline: output will be placed within
<pre>......</pre> tags, so that linebreaks are respected
- formula: all numbers will be subscripted
- URL: text will be displayed as a hyperlink; optionally, a
link label can be specified after a pipe symbol (|), for example: http://www.mydomain.org/mypage.html
| My Homepage
Fields of this type can contain several entries which must be separated by commas
In the user frontend (browse, search, details), newly added structures
are always displayed as SVG images or (for old browsers) using the JME/JSME applet in
"depict" mode, until the bitmap graphics have been created by running
the "updatebitmap.pl" command-line script. This should be done automatically
by a cron job. In the admin frontend (add/edit structure), SVG or JME/JSME is used for structure display.
Input of reactions is done analogously as described above, either by
drawing in JME/JSME or by pasting into the text box in MDL rxnfile format.
Atom maps should be created for as many atom pairs as possible.
Edit/erase entry
"Edit entry no." and "erase entry no." require a valid ID number
(mol_id/rxn_id) as input. If you do not know this number, you can find
it out
by doing a search or just browse the data collection. Whenever an entry
is erased, it is removed from all MySQL tables, but the original bitmap
file (if it exists) will not be deleted. Instead, the "updatebitmap.pl"
script will rename it into something like "00123456.png.to_be_deleted"
on its next run. You can then use some OS shell command in order to
actually delete these files (or modify the updatebitmap.pl script to do
a real deletion).
When editing a reaction,
an
option is presented to "extract molecules". This allows to copy any
reactant or product as a single structure to another (structure) data
collection.
Copy entry
The specified entry number can be either duplicated within the same
data collection (e.g., to create an entry for a closely related
structure/reaction by subsequent minor modification) together with all
textual/numeric data or it can be copied to a different data collection
(in this case, as a structure/reaction together with only those textual/numeric
data that have an exactly matching column in the target data collection).
License
Copyright © 2014-2021 Norbert Haider, University of Vienna
This program is free software: You can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; version 3 of the License.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details: http://www.gnu.org/licenses/.
Appendix A
Permissions and privileges
MolDB6 does not implement its own user administration, but relies on
two access conditions:
- The web server's standard mechanism to protect a directory.
This is usually configured either in a
.htaccess
file in the directory in question or in the central configuration file
(typically /etc/apache2/httpd.conf
), using
the <Directory>
keyword.
Usually, a combination of valid client domain and password protection
(htpasswd) is specified there. Caution:
.htaccess
files are effective only if
enabled in the httpd.conf
file! Please refer
to the Apache documentation. It
is highly recommended to restrict access to the "admin" subdirectory by
this mechanism.
- The IP address of the client machine. Up to 10 "trusted IP"
addresses can be specified in the
moldb6uiconf.php
file: these addresses have all possible privileges within MolDB6
(provided the user meets the access condition defined in .htaccess
,
see above). These users have the status of an "administrator". In
addition, "sub-administrators" can be specified on a
per-data-collection basis, listing their client IP address in the
moldb_meta table, field "trustedIP" (again, up to 10 IP addresses).
Who can do what?
Only the administrator(s)
can
- create a new data collection
- change the properties of a data collection (including the
definition of trusted IP addresses for sub-administrators of this data
collection), in particular setting access mode
("disabled", "read-only", "add/update", "full access")
- erase an existing data collection
The administrator(s)
and the sub-administrator(s)
of a data collection can
- add/edit/erase data records even to/from a "read-only" data
collection
- erase data records from an "add/update" data collection
- add data fields ("columns") or change the properties of
data
fields, erase data fields (with all their content!) even if the data
collection's access mode is not "full access"
Any user who
can access the "admin" subdirectory can
- add/edit/erase data records if the data collection is set
to "full access"
- add/edit data records if the data collection is set to
"add/update"
- add/edit/erase data fields if the data collection's access
mode is "full access"
Appendix B
Parameter settings in the configuration files
Please note that the settings for MolDB6 are kept in different files.
All those settings that are relevant both for the Perl command-line
scripts (for initialisasation and various administrative tasks) and for
the PHP scripts of the web-based user interface are stored in two
almost identical files, named moldb6.conf (in the working directory) and moldb6conf.php
(in the MolDB6 web directory). All variables in these two files should
have the same content; the only difference is that in moldb6conf.php
there must be the "<?php" tag in the first line and the "?>" tag
in the last line.
There are additional settings that are relevant only to the PHP scripts of the web frontend, they are kept in the file moldb6uiconf.php in the MolDB6 web directory.
Variables to be set in moldb6.conf / moldb6conf.php:
$database
= "moldb6"; # name of the
MySQL database
$hostname =
"localhost"; # hostname of MySQL server, probably
"localhost"
$clientname =
"localhost"; # name of MySQL client, usually
"localhost"
$mysql_admin =
"root";
# MySQL administrator, usually "root"
$rw_user =
"mdb6-admin"; # proxy user with CREATE and INSERT
privileges
$rw_password = "ExAmPlE123"; # (make sure this script is unreadable to others)
$ro_user = "mdb6-user"; # proxy user with SELECT privilege
$ro_password = "eXaMpLe456" # (better avoid dollar signs etc.)
$drop_db =
"y";
#
erase entire database before re-creating it?
In this section, MySQL-specific settings have to be made: a databse
name has to be specified. If $drop_db is set to "y", this database will
be erased and recreated every time you run the initdb.pl script. The
machine on which the MySQL server is running is specified by $hostname.
Usually, the MySQL server and the web server are running on the same
machine, in this case $hostname is set to "localhost" and
$clientname is also set to "localhost". The name of the MySQL
administrator user must be specified, ususally this is "root" (not to
be confused with the system "root" user under Linux/Unix systems). The
two proxy user accounts in MySQL will be automatically created when
initdb.pl is run.
$prefix
=
"";
#
this allows to have different MolDB6 instances
# in one MySQL database; each MolDB6 instance can
# contain multiple data collections
As explained in the comment, you can use this extra prefix
to
keep more than one MolDB6 installation in a single MySQL database or
you can place MolDB6 together with other database applications (such
as
bulletin boards, wikis, blogs, etc.) in a single MySQL database. This
makes sense only in environments where you are restricted to use only
one MySQL database.
$charset =
"utf8"; # may be
"latin1", "latin2" or "utf8"
$CHECKMOL =
"/usr/local/bin/checkmol";
$MATCHMOL =
"/usr/local/bin/matchmol";
The $charset variable controls both HTML and MySQL handling of
character sets. Currently, only latin1, latin2 and utf8 are supported. If other character sets are desired, please edit the
function set_charset() in the functions.php script.
In $CHECKMOL and $MATCHMOL, the executables of checkmol and matchmol
are specified. Under Linux, the programs are usually installed in the
/usr/local/bin
directory (most conveniently,
matchmol is just a hard
link to checkmol), under Windows it can be installed, for example, in
C:\bin
(but do not include the drive letter in
these values). Setting $CHECKMOL to "checkmol" would be also OK if the
program's
directory is included in the system's search path (beware: this must
apply also to the search path of the web server process).
On Unix-like machines (Linux, Mac OS X), the cmmmsrv server program can
be used by the PHP scripts instead of the checkmol/matchmol
command-line program. For details, see below in the section referring
to moldb6uiconf.php.
$MOL2SVG = "/usr/local/bin/mol2svg";
$mol2svgopt = "--rotate=auto3Donly
--hydrogenonmethyl=off --color=/usr/local/etc/color.conf"; # options
for mol2svg, e.g."--showmolname=on"
#$mol2svgopt = "--rotate=auto3Donly
--hydrogenonmethyl=off"; # options for mol2svg, e.g. "--showmolname=on"
$mol2svgopt_rxn = "-R --rotate=auto3Donly --hydrogenonmethyl=off"; # options for mol2svg in reaction mode
$svg_scalingfactor = 1.0; # 1.0 gives good results
$svg_scalingfactor_rxn = 0.80; # 0.75 is a good compromise for reactions
These settings refer to the mol2ps/mol2svg program which is used to
generate high-quality vector images for 2D depiction of the molecular
structures. For a detailed description of possible parameters, please
see the mol2ps/mol2svg homepage at https://homepage.univie.ac.at/norbert.haider/cheminf/mol2ps.html
$MOL2PS
= "/usr/local/bin/mol2ps";
$GHOSTSCRIPT = "/usr/bin/gs";
#$bitmapdir = "";
$bitmapdir =
"/var/www/htdocs/moldb6/bitmaps"; # this is the base
directory
$bitmapURLdir = "/moldb6/bitmaps";
#$mol2psopt =
"--rotate=auto3Donly
--hydrogenonmethyl=off --color=/usr/local/etc/color.conf"; # options
for mol2ps
$mol2psopt =
"--rotate=auto3Donly --hydrogenonmethyl=off"; # options for mol2ps,
e.g. "--showmolname=on"
$scalingfactor =
0.22;
# 0.22 gives good results
These options are relevant only if you want to use bitmap
graphics for 2D depiction of your molecular structures, otherwise set
$bitmapdir to an empty string ($bitmapdir = ""). $MOL2PS points to the
mol2ps program, $GHOSTSCRIPT to the GhostScript program (under Windows,
this must be the console application "gswin32c.exe"). The directory
specified in $bitmapdir must be writable to you, but not necessarily to
the web server. Whenever new structures are entered via the web
frontend, bitmap files are not immediately created, but queued for
creation by an auxiliary script "updatebitmap.pl" which should be run
in regular intervals by the user who has write permission to
$bitmapdir. Most conveniently, this is done by a cron job, e.g. once a
day.
Within the bitmap directory, subdirectories are created automatically
by updatebitmap.pl (as well as sdf2moldb.pl) for each data collection,
using the number ("db_id") of that data collection as the directory
name. For larger data collection (>10000 structures), it is
advisable to further divide the directory hierarchy by specifying a
value of (e.g.) 4 as $subdirdigits and (e.g.) 8 as $digits for the PNG
file names. This would give paths to the actual graphics files like
/var/www/htdocs/moldb6/bitmaps/1/0000/00000045.png
(in an URL, this
would be /moldb6/bitmaps/1/0000/00000045.png
).
Thus, the maximum number
of PNG files in one directory can be limited, otherwise you might run
into problems with some shell commands. In previous versions of MolDB,
the digit numbers were also set in the configuration file, but in the
current version they are set per data collection in the metadata table
via the web admin frontend.
For all available options of mol2ps, please refer to the mol2ps
homepage or just enter "mol2ps" on a command prompt.
$enable_inchi = "y"; # "y" or "n"
$INCHI = "/usr/local/bin/inchi-1"; # available from the IUPAC website
$INCHI_OPT = "-STDIO -Key"; # Linux: " -STDIO -Key", Windows: "/STDIO /Key /AuxNone"
These options are relevant only if you want to use the
auto_mol_inchikey feature. For this purpose, you will need the inchi-1
program which is available for download from the IUPAC website.
# further
settings=============================================================
$fpdict_mode =
1; # 1 = auto adjust,
2 = force 64 bit, 3 = force 32 bit
$scratchdir =
"/data/moldb/moldb-scratch"; # needed by cp2mem.pl,
# must be writeable by mysql UID, too
The $scratchdir variable holds the name of a directory for temporary
files which are used by "cp2mem.pl" when it copies the content of
disk-based tables (molstat and molcfp) to memory-based tables (see
Appendix C). It is very important that not only the user who runs
"cp2mem.pl" has write access to this directory, but also the MySQL
server process (which usually runs as user "mysql").
The last part of the configuration file contains various definitions
for table names and should not be changed, unless you have a very good
reason to do so.
Variables to be set in moldb6uiconf.php:
$sitename
= "MolDB6 demo"; # appears in title and headline
$cssfilename = "moldb.css"; # the default CSS (cascading style sheets) file
The $sitename string will appear in the header of all MolDB6 pages. The
$cssfilebname specifies the name of the stylesheet file which should be
present in the MolDB6 web directory (default: "moldb.css"). If you want
to experiment with your own stylesheets, you can use this setting to
quickly change the CSS file.
$use_cmmmsrv =
"y";
# set
to "y" if a cmmmsrv daemon is available
$cmmmsrv_addr = "127.0.0.1"; # must be
numeric
$cmmmsrv_port =
55624;
# the default cmmmsrv port is 55624
Whereas all the backend Perl scripts rely on checkmol/matchmol, the PHP
portion of the package can also use the cmmmsrv daemon instead, if
installed on your system. This server process provides the same
functionality as the command-line programs (checkmol/matchmol), but
communicates with the web server via TCP sockets instead of shell
calls. This gives somewhat better performance (see Appendix C). In
principle, cmmmsrv
can run on a different machine than the web server, in this case
$cmmmsrv_addr must be set to the (numeric) IP address of this machine.
$multiselect =
"y";
#
allow selection of multiple data collections
$default_db =
"1";
# default db_id (if more than one, use quotes and commas, e.g. "1,3")
$trustedIP = "127.0.0.1,
192.168.0.10,::1"; # max. 10 IP addresses
$enablereactions = "y"; # if not "y": no
reaction support
In this section, you can specify whether or not users can select only
one data collection for searching/browsing or all available data
collections. If there are more than one data collections, $default_db
specifies the one(s) which is/are automatically selected whenever a user
visits your MolDB6 website.
In $trustedIP, you can specify up to 10 (numeric) IP addresses for
extended administrative privileges. Besides protecting the "admin"
subdirectory with a password (using the standard mechanisms of the
Apache
web server, e.g. .htaccess files), the $trustedIP directive gives extra
protection. Only from a client machine listed in $trustedIP, you can
change the settings of a data collection (e.g. from read-only to
read/write) or erase an entire data collection. If you want to assign
special administrative priveleges to just one data collection to
somebody (e.g. for write access to a read-only data collection), use
the analogous "trusted IP address" specification in the web
administration frontend.
The $enablereactions variable controls whether or not MolDB6 presents
its support for reaction databases to the user. If set to "n", the
"Reaction Search" menu item will not be displayed and only SD data
collections are presented for selection. Nevertheless, the
administration front-end will contain all the tools for reaction
handling.
$enable_download = "y"; # download option for hit structures/reactions
$download_limit =
100; # maximum number
of hit structures/reactions to be downloaded (per search)
These two variables control if a user can download hitlists from
text/functional group/structure searches in SD/RD file format and how
many entries may be downloaded per SD/RD file.
$enable_adminlink = "y"; # "y" or "n"; show/hide "Administration" menu item
If set to "y", the menu contains an "Administration" link, pointing to the admin/index.php script.
$enable_prefs =
"y"; # "y"
or "n"; enable choice of structure editor (JME, JSME, Ketcher, FlaME)
$default_editor = "jsme"; # may be
"jme", "jsme", "ketcher" or "flame" (theoretically, also "text")
If $enable_prefs is set to "y", the user may choose his/her preferred
structure editor or text input (if enabled). This choice (done in a
pop-up window) makes sense only if there is more than one editor
available. A default editor should be specified, in case the user does
not select anything or has coolies disabled.
$enable_svg = "y"; # first choice
$enable_bitmaps = "y"; # second choice
$enable_jme =
"y"; # structure editor;
fallback for 2D display
$enable_jsme =
"y"; # structure editor;
fallback for 2D display (preferred)
$enable_ketcher = "y"; # structure editor for input only
$enable_flame = "y"; # structure editor for input only
$enable_textinput = "y"; # allow users to enter MDL molfiles/rxnfiles as text
These variables control the 2D structure display. The preferred option
for 2D display is SVG which is supported on all current browsers.
MolDB6 tries to figure out automatically the clients browser version
and decides if it uses SVG for structure images. For older browsers, it
is handy to have bitmap images instead. For this purpose, PNG files are
created as explained above. As a further fallback, the JME applet or
its JSME counterpart can be used to display molfiles/rxnfiles in
"depict" mode. This option, however may be problematic when long
hitlists are displayed, because too many instances of the applets are
used which consumes a lot of client resources.
GGA Ketcher and FlaME can be used only for structure/reaction input,
but not for depiction in hitlists. The $enable_textinput variable
controls if users are allowed to enter queries in MDL molfile/rxnfile
format in a text input box.
$java_codebase = "/classes"; # either leave empty or specify URL of the directory containing JME.jar
$jsme_path = "../jsme"; # absolute or relative path to the JSME directory
$ketcher_path = "../ketcher"; # absolute or relative path to the GGA Ketcher directory
$flame_swf = "../flame/flamer.swf"; # The FlaME swf file with absolute or relative path
For all installed (and enabled) editors, their location must be
specified. JME is an exception if the JME.jar file is installed in the
MolDB6 directory: in this case, the $java_codebase variable should be
set to an empty string.
If JSmol/JSV is used for structure or spectra display (using the two
new HTML display formats introduced in the 2018 release of MolDB6), the
location of the JSmol package must be specified as well as the
dimensions of the pop-up window that shows the structure/spectrum.
$jsmoldir = "../jsmol";
$jspopup_width = 1024;
$jspopup_height = 700;
Appendix C
Performance tuning
There are three options which should be considered when maximum
performance in substructre searches is desired:
- The use of "tweaked"
molfiles
for internal storage of structures instead of regular molfiles is enabled by default. This is
a feature of checkmol/matchmol: information about aromatic
substructures (per atom and per bond) is stored in unused columns of
the MDL molfile connection table. Whenever matchmol encounters such a
"tweaked" molfile, it can skip lengthy ring detection and aromaticity
determination routines (for a more detailed description, please check
the checkmol/matchmol
homepage).
- Using cmmmsrv
instead of
checkmol/matchmol: cmmmsrv is the socket-based server equivalent to the
command-line programs, checkmol and matchmol. The PHP scripts
communicate with cmmsrv in a similar way as with checkmol/matchmol, but
instead of shell calls (via the
popen()
function of PHP),
a TCP socket connection is established to the cmmmsrv process for
reading and writing. This allows for larger blocksizes to be
transferred between PHP and the cheminformatics engine, thus keeping
the overhead smaller. Please note that cmmmsrv is available only for
Linux and Mac OS X, not for Windows.
If you prefer checkmol/matchmol instead of cmmmsrv, there is another
performance improvement possible on a Linux system: replace the
standard shell "bash" with the smaller and faster "ash" shell. This can
be done by removing the symbolic link "/bin/sh
"
(which by default points to "/bin/bash
" and
creating a new symbolic link "/bin/sh
" which
points to "/bin/ash
" (thanks to Ernst-Georg
Schmid for this hint). Caution:
some Linux distributions may rely on /bin/sh
to be a symbolic link to /bin/bash
for their
startup scripts (in /etc/init.d
). To be on
the safe side, replace all occurrences of "#!/bin/sh
"
in these script with "#!/bin/bash
".
- Using memory-based
MySQL tables
instead of disk-based tables: the two tables "molstat" and "molcfp" are
particularly critical in terms of search performance for the fast
pre-selection process. MySQL offers the possibility to keep tables
entirely in memory, whereas only their data structure is permanently
stored on disk. These memory-based tables are accessed considerably
faster than their disk-based counterparts. As a disadvantage, the
content of memory-based tables is lost every time the MySQL server
process is restarted. In MolDB6, a combination of disk-based tables
and
memory-based copies of the original (persistent) tables is used. This
means, all the information about molecular statistics (molstat) and
fingerprints (molcfp) is permanently stored in the disk-based tables,
and as an option you can enable copies of these tables as memory-based
tables. In order to make use of this option, you have to set the
"usemem" flag in the moldb_meta table to "T" for the data collection in
question (this is done with the Perl script "setmem.pl") and
subsequently you have to copy the entire content of the molstat and
molcfp tables to memory (this is done with the Perl script "cp2mem.pl"
or its alternative script "cp2mem_slow.pl"; the latter does not require
write permissions for the MySQL process to a scratch directory). Please
note that the use of memory-based tables is recommended only for
"read-only" data collections, as any change to a chemical structure in
a data collection will disable the memory-based tables until you run
"cp2mem.pl" for the next time. For large data collections (e.g., with 1
million structures), it will be necessary to have sufficient RAM in
your machine and you probably have to adjust some settings in the MySQL
configuration file (my.cnf): set
max_heap_table_size
and tmp_table_size
to sufficiently high
values.
NH, 2014-07-30; last update:
2021-02-28