[Molecularmechanics] CML info on the Wiki
Peter Murray-Rust
molecularmechanics@tddft.org
Tue, 02 Dec 2003 18:39:41 +0000
At 11:51 02/12/2003 +0100, Konrad Hinsen wrote:
>On Monday 01 December 2003 19:18, Peter Murray-Rust wrote:
>
> > We have now autogenerated Wiki pages from *all* the
> > elements/attributes/dataTypes in CML and these are available at
> > http://wwmm.ch.cam.ac.uk/moin/CmlSchemaComponents.
[I should point out that this is readonly. If anyone wishes to join as
developer please mail me.]
>Thanks! This looks useful to have around.
>
> > There about 200 schema components in all and although quite a few are
> > irrelevant to FSATOM this indicates the scale of the problem that is
> > involved. There are a number of additional components that still need to be
>
>Indeed, but it also indicates a problem that CML users have, in particular
>authors of programs that want to write CML files: in the absence of an
>inverse documentation (which is not easy to produce), they have to read all
>of this carefully to figure out what elements they can and should use for
>their particular needs.
This is a valid point. It applies to many systems - Java libraries, XML
protocols, etc. which are a formalization of the real world. Thus to create
a date in Java using Calendar could possibly involve you in the complexity
of different calendars, timezones etc. CML shares part of that complexity -
for example we have spent much time recently updating CMLReact and being
able to support most of what people call "reactions".
CML is therefore designed to be used as a series of components rather than
a "one fits all" chemical system. We are developing an approach to create
schemas tailored to specific applications. Thus - in principle - GROMACS
developers might select those components which the programs supports or
interact with. ABINIT would select a different set.
There are a number of CML elements which relate to the macroscopic world
(amount, abundance, etc.) that won't be needed. The fooList elements are
also usually very simple - containers for one or more foos. But complex
programs (GAMES, Gaussian, MOPAC) can calculate reactions and spectra and
so would be able to make use of these elements. They cover a lot of chemistry!
In general most users will only use a small part of the system and I am
keen to keep the number of elements down. Newcomers may be overawed by the
apparent complexity of the system, and we have to overcome this with
automatic documentation, examples, etc. It is also in our interests to keep
CML manageable - we have to write code for every element.
The approach depends on being able to make components independent and to
create an API automatically from the schema. So far this looks feasible.
>How does this work in practice? Has anyone developped CML-enabled programs
>without the active help from the CML development team?
Yes. The CDK/Jmol/JChempaint (Java) groups have all based their reference
I/O on CML and written their own toolkits. JOElib (Java) has also created a
large CML library based solely on the CML Schema published earlier this
year. I have written a C library for OpenBabel and am coverting this to
C++. All these are opensource and can be used as components in any system
(as long as GPL is acceptable). There is a *lot* of functionality in these
systems (legacy file reading, aromatic perception, substructure search,
tautomerism, and FSATOM may be able to make some use of it.
Our approach is to create a toolkit which supports each schema component
individually. Users will then pick the components they require, add
non-schema functionality from toolkits such as those above to create
applications. In general CML toolkits will be mainly used in program i/o,
scripting and glueware and not in compute-intensive regions. We expect that
applications will read from CML, convert to internal data structure,
compute, reconvert to CML and output.
> > We have also developed a code generator which has been used to create the
> > Wiki. We are also hacking Java and C++. and would be interested to have a
> > look at python when this is needed.
>
>What would such code generators be used for?
The Wiki and the Java have already been automatically generated. We intend
to extend this to C++, F90 and python. The code would provide means to read
CML generically, with validation, and an API for accessing the data read
in. Output is the reverse. A typical sequence (in pseudocode) is:
doc1 = readCML(filename1)
if (!doc1.validate()) throw error()
mol = doc1.getFirstMolecule()
doc2 = readCML(filename2)
if (!doc2.validate()) throw error()
ff = doc.getForceField()
//... application code goes here
// internalMol and internalff are program-specific data structures
internalMol = mol.toInternal()
internalff = ff.toInternal()
internalMol.optimise(internalff)
mol2 = internalMol.toCML();
// end app
mol.writeXML(filename3)
The CML toolkit provides everything except the lines with internal in. This
is already how I use the CDK library to add functionality to CML
Peter
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069