The checkmol/matchmol Homepage
What
is
checkmol/matchmol?
Which input formats are supported by
checkmol/matchmol?
How can checkmol/matchmol be used?
How to obtain checkmol/matchmol?
What are the requirements of
checkmol/matchmol?
Compiling and installing
checkmol/matchmol
Usage (command-line options):
Features
Windows DLL version
Linux server version
Links
Contact
What is
checkmol/matchmol?
Checkmol is
a command-line
utility program which reads molecular structure files in different
formats (see below) and analyzes the input molecule for the presence of
various functional groups and structural elements. At present, approx.
200 different functional groups are recognized. Output can be either
clear text (English or German), a bitstring or its ASCII
representation, or a set of special 8-character codes.
This output can be easily placed into a database table, permitting the
creation of chemical databases with a functional group search option.
Here is a complete list of
recognized function
groups (PDF).
Another output option of checkmol is a set of statistical values
derived from a given molecule, which can also be used for quick
retrieval from a database. These values include: the number of atoms,
bonds, and rings, the number of differently hybridized carbon, oxgen,
and nitrogen atoms, the number of C=O double bonds, the number of rings
of different sizes, the number of rings containing nitrogen, oxygen,
sulfur, the number of aromatic rings, the number of heterocyclic rings,
etc. The combination of all of these values for a given molecule
represents some kind of "fingerprint" which is useful for rapid
pre-selection in a database structure/substructure search prior to a
full atom-by-atom match (see below). For a fully functional set of PHP
scripts implementing such a web database (plus utility scripts for
data import), please visit the MolDB5
homepage.
Matchmol
complements the
capabilities of checkmol. It compares two (or more) molecular
structures and determines whether one of them is a substructure of the
other one. This is done by a full atom-by-atom comparison of the input
structures. Thus, matchmol can be used as a back-end program for
structure/substructure search operations in chemical databases (see
below).
More detailed information is available in this publication:
Haider,
N., Functionality Pattern Matching as an Efficient Complementary
Structure/Reaction Search Tool: an Open-Source Approach. Molecules, 15, 5079-5092
(2010).
Which input formats
are supported by
checkmol/matchmol?
As input files, MDL molfiles (*.mol; 2D and 3D), Alchemy molfiles
(*.mol), and Sybyl mol2-files (*.mol2) are currently understood by
checkmol/matchmol, the preferred format is the MDL molfile format. The
matchmol utility can also process MDL SD-files which can contain
multiple molecular structures. At present, it is not intended to extend
the number of supported input file formats, as there are powerful file
format converters available, such as OpenBabel.
A detailed
description of the MDL file formats (molfile, SD-file) is available here.
How can
checkmol/matchmol be used?
The main purpose of checkmol/matchmol is to permit the creation of
fully searchable, web-based molecular structure databases entirely with
free software. For example, a typical LAMP system (Linux, Apache,
MySQL, PHP) can be easily extended with checkmol/matchmol into a
chemical
database with structure/substructure search options. A detailed
description of how this can be done is given here.
Another application is batch-mode processing of data files containing
multiple structures, in our case MDL SD files. For instance, one can do
a substructure search e.g. for uracil-containing molecules in a large
SD file like the Maybridge screening collection and write the matching
molecules into another SD file. This can be achieved with the following
command:
matchmol -m uracil.mol maybridge-complete.sdf
>
maybridge-uracils.sdf
The -m
option causes output of hits in MDL
molfile format
(including any additional fields of the input SD file), uracil.mol
contains the query structure (the "needle") and maybridge-complete.sdf
is the database file (the "haystack"). Since version 0.2g of
checkmol/matchmol, there is no size limit for the "haystack" file.
How to obtain
checkmol/matchmol?
The two programs are in fact only one program which is invoked by two
different names, i.e. there is only one source code. The utility is
freely available under the terms of the GNU General Public License
(GPL), for a detailed description of this license, please visit http://www.gnu.org/copyleft/gpl.html.
Download:
please visit the download directory at
../download/chemistry/checkmol/,
it contains the source code (
checkmol.pas
is a symbolic link to the latest source file) as well as pre-compiled
binaries for various platforms (Windows, Mac OS X) in the "
bin" subdirectory; there is also a socket-based server version for Un*x-like systems (cmmmsrv) in the "
server" sundirectory.
for a brief description of version history, please check the source code
What are the
requirements of
checkmol/matchmol?
The software is available both as source code and as a binary compiled
for Linux (x86 architecture). It is entirely written in Pascal and it
was compiled with Free Pascal 1.0.11 or Free
Pascal 2.4.0 (starting from v0.4c). The Free Pascal compiler
is also
freely available under the GPL, and there are versions for a variety of
operating systems and computer architectures. For more information
about Free Pascal, please visit the project homepage at http://www.freepascal.org.
The
binary executable of checkmol/matchmol was built on a SuSE 10.1
or on a Ubuntu 10.04 system, but it should run on any other x86 Linux
distribution, as
there are no special libraries required. Supported platforms include
also MS Windows (NT, 2000, XP).
Compiling and
installing checkmol/matchmol
Compile with fpc (Free Pascal, see above), using the -Sd or -S2 option
(Delphi
mode; this is IMPORTANT!)
Example for compilation and installation:
fpc checkmol.pas -S2 -O3 -Op3
Note: if
you are running
MacOS X, use the
following command:
fpc
checkmol.pas -S2
-Tdarwin
as described on the
Macs
in
Chemistry
website (i.e., do
not
use the
compiler optimisation flags)
This will give a file "checkmol.o" and a file "checkmol";
then,
as
"root" user, do the following:
cp checkmol /usr/local/bin
(or any other
directory in your path)
cd /usr/local/bin
ln checkmol matchmol
(ATTENTION:
a symbolic link does not
work!)
Note that checkmol and matchmol are the same executable, but the
program behaves differently depending on the name it was invoked with.
Of course, you can also copy
"checkmol" to "matchmol" (instead of making a link), but then it
takes twice as
much disk space (under Windows, this is the only possibility, as there
are no hard links available under this "OS").
Usage (command-line
options):
checkmol
can be invoked with the following arguments
checkmol [options] <filename>
where [options] can be:
-l print a list of fingerprint
codes + explanation and exit
-v verbose output
-r force SSR (set of small
rings) ring search
mode
-M accept
metal atoms as ring members
and one of the following:
-e english text
(common name of
functional group; default)
-d german text
(common name of
functional
group)
-c code
(acronym-like code for
functional
group)
-b bitstring (in decimal format) representing the
presence of each group
-s (the ASCII
representation of the
above
bitstring, i.e. 0s and 1s)
-p lists the position of each functional group (atom number of key atom)
-x print molecular
statistics (number of various atom types, bond types, ring sizes, etc.
-X same as above,
listing all records
(even if
zero) as comma-separated list
-a
count charges in fingerprint
-m write MDL molfile (with
special encoding
for aromatic atoms/bonds)
-h hashed fingerprint mode
with boolean output
-H hashed fingerprint mode
with decimal output
options
can be combined (like -vc); <filename> specifies any file
in
the formats supported (MDL *.mol, Alchemy *.mol, Sybyl *.mol2), the
filename "-" (without quotes) specifies standard input
matchmol can
be invoked with
the following arguments
matchmol [options] <needle>
<haystack>
where <needle> and
<haystack> are the two
molecules
to compare
(supported formats: MDL *.mol, Alchemy *.mol,
Sybyl *.mol2)
options can be:
-v verbose output
-x exact match
-s strict
comparison of atom and bond
types
-r force SSR (set
of small rings) ring
search mode
-m write matching molecule as
MDL molfile to standard output
-M accept metal
atoms as ring members
-n additional output of atom
numbers for matching atom pairs
-N like -n, but only for the
first matching substructure found
-g check geometry of double
bonds (E/Z)
-G check geometry of chiral
centers (R/S)
-a check charges strictly
-i check isotopes strictly
-d check radicals strictly
-f fingerprint mode (1
haystack, multiple
needles) with boolean output
-F fingerprint mode (1
haystack, multiple
needles) with decimal output
Default output: record number + ":T" for hit or
":F" for
miss, i.e., if the haystack contains only one molecule, then
the
result will be "1:T" or "1:F". The "haystack" can also be a MDL SD-file
(containing multiple molecules); if invoked with "-" as file argument,
both "needle" and "haystack" are read as only one SD-file from standard
input, assuming the first entry in the SDF to be the "needle"; the
output is: entry number + ":F" (false) or ":T" (true)
Features
At present, only smaller molecules are handled adequately, i.e. for
each molecule the maximum number of atoms is 1024, the maximum number
of bonds is 1024, the maximum ring size is 128 (i.e., rings larger than
128 members are treated as open-chain compounds), and the maximum
number of rings is 1024. Checkmol/matchmol collects the "set of all
rings" (SAR) instead of e.g. the "smallest set of smallest rings"
(SSSR).
Aromaticity is determined by application of the Hückel rule
(4n +
2
pi electrons) without any geometry checks, but with adequate treatment
of tautomeric/mesomeric structures where possible. For example,
1-methyl-2(1H)-pyridone is correctly recognized as aromatic, as well as
cyclopentadienyl anion, tropylium cation, fulvene, tropone, etc.
New in version 0.2: if a molecule contains more than 1024 rings, a
fallback mechanism changes the ring search mode from SAR to SSR (set of
small rings, which is defined as follows: ringsize <= 12 atoms,
no
ring is completely contained in another one). For additional
information, please check the version history description in the source
code.
Starting with versions 0.3d and 0.3f, matchmol supports stereospecific search operations,
either globally or on a per-atom or per-bond basis. Geometric isomers
of the E/Z type (aka cis/trans
isomers) are recognized as well as isomers with chiral centers (R/S
isomers). The latter type of isomer discrimination works with 3D
molfiles (using the XYZ coordinates) and with 2D molfiles (using "up"
and "down" bond notation) in any combination.
Starting with version 0.4, checkmol supports the generation of hash-based fingerprints
for efficient pre-selection in structure databases. The default values
are as follows: only linear fragments, minimum fragment length: 3
atoms, maximum fragment length: 8 atoms, 2 bits per fragment, total
bitstring length: 512 bits.
Starting with version 0.5, checkmol has an option (-p) to display all
occurrences of all detected functional groups in a molecule by listing
the corresponding "key atoms" (for a graphical representation of all
functional groups with their key atoms, see the document fgtable.pdf).
Windows DLL version
Although the program can be smoothly compiled with Free Pascal on the
Win32 platform as a console application, its encapsulation in a Windows
Dynamic Link Library (DLL) would have specific advantages, such as
seamless integration into database applications like MS Access (using
VBA as the link). Alessandro
Barozza
from PROCOS had
the idea for this
DLL version and he also realized its implementation. Cited from
Alessandro's code header:
WHY?
I needed substructure matching capability.
I needed a dll for using with visual basic or VBA
(MS-Access).
I needed to pass the mol file as string (from a memo field
in a
database
and not as a molfile on the disk)
so... I've modified the original matchmol
A more detailed description of the features of this DLL and how to use
it are given in the header of the source code (see download link
below). Alternatively, you can use the Barsoi DLL, a library based on a
C port of checkmol/matchmol which has been developed as a part of the
pgchem::tigress project by Ernst-Georg Schmid (see below).
Download:
Linux server version
cmmmsrv, a
socket-based
server program providing checkmol/matchmol functionality has been
developed as a replacement for the checkmol/matchmol command-line
program in web-based molecular structure databases and related
applications. Communication of any frontend program (e.g., a PHP
script) with cmmmsrv takes place via sockets instead of shell calls,
thus saving a significant amount of time.
Download:
source code:
cmmmsrv.pas
(400 KB)
compiled Linux (i586) binary:
cmmmsrv.gz
documentation:
readme.txt
examples for using cmmmsrv can be found in the
MolDB5R
package (e.g., in the script incss.php)
Links
- pgchem::tigress
(http://pgfoundry.org/projects/pgchem/)
Pgchem::tigress is a cheminformatics extension to the PostgreSQL DBMS.
It enables PostgreSQL to store, retrieve and search molecules by pure
SQL statements. It uses checkmol/matchmol and OpenBabel and optionally
the Barsoi
accelerator library.
Barsoi, which is based on a C port of checkmol/matchmol, can also be
used as a dynamically linked library to provide checkmol/matchmol
functionality to other programs, or to build checkmol/matchmol on
platforms where no Pascal, but an ANSI-C compiler is available.
Developer: Ernst-Georg
Schmid (Bayer Business Services GmbH, Leverkusen, Germany).
- chemtool
(http://ruby.chemie.uni-freiburg.de/~martin/chemtool/chemtool.html)
Chemtool is a small
program for
drawing chemical structures on Linux and Unix
systems using the GTK toolkit under X11. Starting with developer
version 1.7, it
adds the beginning of database support with (sub)structure searches in
SDF files
or MySQL databases using the checkmol/matchmol program. Developer:
Martin Kroeker (University of Freiburg, Germany).
- Small
Molecule Interaction Database (SMID)
(http://smid.blueprint.org/)
SMID is an expanding database of small molecule - domain
interactions determined from MMDB records. All information is stored in
SMID database records that are freely available through a web
interface. Among other classification criteria, a newly designed
chemical ontology organises compounds by their functional groups which
are automatically assigned by the checkmol program.
- CSEARCHlite
(http://nmrpredict.orc.univie.ac.at/csearchlite/)
A web-based version of Wolfgang Robien's CSEARCH NMR spectral database
and prediction system uses checkmol/matchmol as the engine of its
structure/substructure/functional group search facility (approx.
140.000 structures).
- open enventory
(http://www.open-enventory.de/)
A web-based integrated lab journal and chemicals inventory, developed
at the Technical University of Kaiserslautern/Germany (contact. F.
Rudolphi). This open-source package makes use of matchmol technology
for substructure searching.
- PiHKAL-info Search Page (https://isomerdesign.com/PiHKAL/search.php)
PiHKAL · info is a visual index and map of the
book “PiHKAL: A Chemical Love Story,” by Alexander &
Ann Shulgin.
- SyBOrCh Chemicals Database
An open-source chemicals database written by John Braun at the Vrije
Universiteit Amsterdam. It runs on php framework Laravel and allows for
substructure searches, using checkmol/matchmol and JSDraw.
Contact
Checkmol/matchmol was written by Norbert
Haider,
Department of Pharmaceutical Chemistry, University of Vienna, Austria.
You can contact me by e-mail: norbert.haider@univie.ac.at
(no spam, no viruses, please).
N.
Haider, 2003-12-01; last update: 2018-04-10