[Molecularmechanics] Molecular structure
Peter Murray-Rust
molecularmechanics@tddft.org
Fri, 24 Oct 2003 16:35:08 +0100
(0) how to you wish to work? Do you want people to reply to the list which
is then edited, or to annotate the wiki?
(1) these are very early thoughts
>A full description of a molecular system includes:
* a description of the molecules
I will give the CML approach
molecules can be composed of atoms or molecules, and optionally
bonds. The hierarchy can therefore hold:
haemoglobin:
2*(alpha subunit+heme) + 2*(beta+heme)
molecules, atoms and bonds can all have extensible properties,
some specified in the language, some extensible by user
molecules can have symmetry operations pertaining to them so they
can have different local frames
molecules can have other versions (either copies or different
views). This avoids repeating common information like element types but
allows changes in coordinates
molecules and atoms can have velocities and displacements
* the topology and geometry of the system (periodic boundary conditions etc.)
and possibly, depending on the type of simulation:
we have started to address this in CMLCM though it is very early
* the thermodynamic ensemble
what is included in this? the thermodynamic quantities?
* a description of the non-obvious dynamic variables (e.g. fluctuating
charges)
if these are charges on atoms CML can hold them and their variation
* constraints
CMLComp has limited support for this
2 Molecules and related objects
A fundamental decision is how to deal with the hierarchical structure of
complex molecules. Here are some options, which represent extreme choices,
many intermediates are envisageable. All represent a peptide chain for
illustration.
Note of caution. This is well explored both in CML and mmCIF, etc. It would
be valuable to reuse their concepts where possible as you may get extensive
software. remember that every element needs software to drive it.
* Flat representation
<molecule>
<atom id="1" element="C" name="CA" residue="ALA" resnumber="1"
chainnumber="1"/>
<atom id="2" element="N" name="N" residue="ALA" resnumber="1"
chainnumber="1"/>
...
</molecule>
Advantage: simplicity
Disadvantage: structure difficult to deduce
Agreed. Similar to PDB. FWIW early CML supported this. note that XML IDs
(if you wish to use them) must start with a letter. You can define id as
CDATA (which is what I do. Note also that it may help to structure ID names
to help avoid collisions when merging files
CML2 (and CML1) support a much condensed version:
<atomArray elementType="C N..." atomID="a1 a2..." x3="1.23 2.34..."/>
there are no line limits in XML, but in any case it can be broken at whitespace
* Specific tree representation
<peptidechain>
<residue type="ALA" sequencenumber="1">
<peptidegroup>
<atom id="1" element="N" name="N">
...
</peptidegroup>
<sidechain>
<atom id="2" element="C" name="CA">
...
</sidechain>
</residue>
...
</peptidechain>
Advantage: clear semantics
Disadvantage: rigid structure
Agreed. main comment is that you are inventing a language specifically for
proteins. What happens for other groups?
* Generic tree representation
<group type="peptidechain">
<group type="residue" name="ALA">
<group name="peptidegroup">
<atom id="1" element="N" name="N">
...
</group>
<group name="sidechain">
<atom id="2" element="C" name="CA">
...
</group>
</group>
...
</group>
Advantage: very flexible
Disadvantage: analysis of structure takes some effort
These are a good illustration of the varieties of abstraction. If you want
hierarchy I would prefer 3 to 2 if it can be worked out, but note that it
requires more software and constraint checking.
Note that you can also have references:
<peptidechain>
<residue idRef="ALA" sequencenumber="1"/>
<residue idRef="GLU" sequencenumber="2">
...
</peptidechain>
and in a separate file:
<residue id="ALA" sequencenumber="1">
<peptidegroup>
<atom id="1" element="N" name="N">
...
</peptidegroup>
<sidechain>
<atom id="2" element="C" name="CA">
...
</sidechain>
</residue>
XML has only very limited support for id->idref but it is a fundamental concept
P.