[Molecularmechanics] Molecular structure

Peter Murray-Rust molecularmechanics@tddft.org
Fri, 24 Oct 2003 16:35:08 +0100


(0) how to you wish to work? Do you want people to reply to the list which 
is then edited, or to annotate the wiki?

(1) these are very early thoughts

 >A full description of a molecular system includes:
* a description of the molecules

I will give the CML approach
         molecules can be composed of atoms or molecules, and optionally 
bonds. The hierarchy can therefore hold:
                 haemoglobin:
                         2*(alpha subunit+heme) + 2*(beta+heme)
         molecules, atoms and bonds can all have extensible properties, 
some specified in the language, some extensible by user
         molecules can have symmetry operations pertaining to them so they 
can have different local frames
         molecules can have other versions (either copies or different 
views). This avoids repeating common information like element types but
                 allows changes in coordinates
         molecules and atoms can have velocities and displacements

* the topology and geometry of the system (periodic boundary conditions etc.)
and possibly, depending on the type of simulation:
         we have started to address this in CMLCM though it is very early

* the thermodynamic ensemble
         what is included in this? the thermodynamic quantities?
* a description of the non-obvious dynamic variables (e.g. fluctuating 
charges)
         if these are charges on atoms CML can hold them and their variation
* constraints
         CMLComp has limited support for this

2 Molecules and related objects
A fundamental decision is how to deal with the hierarchical structure of 
complex molecules. Here are some options, which represent extreme choices, 
many intermediates are envisageable. All represent a peptide chain for 
illustration.

Note of caution. This is well explored both in CML and mmCIF, etc. It would 
be valuable to reuse their concepts where possible as you may get extensive 
software. remember that every element needs software to drive it.

* Flat representation
<molecule>
   <atom id="1" element="C" name="CA" residue="ALA" resnumber="1" 
chainnumber="1"/>
   <atom id="2" element="N" name="N" residue="ALA" resnumber="1" 
chainnumber="1"/>
...
</molecule>
Advantage: simplicity
Disadvantage: structure difficult to deduce

Agreed. Similar to PDB. FWIW early CML supported this. note that XML IDs 
(if you wish to use them) must start with a letter. You can define id as 
CDATA (which is what I do. Note also that it may help to structure ID names 
to help avoid collisions when merging files

CML2 (and CML1)  support a much condensed version:
<atomArray elementType="C N..." atomID="a1 a2..."  x3="1.23 2.34..."/>
there are no line limits in XML, but in any case it can be broken at whitespace

* Specific tree representation
<peptidechain>
   <residue type="ALA" sequencenumber="1">
     <peptidegroup>
       <atom id="1" element="N" name="N">
       ...
     </peptidegroup>
     <sidechain>
       <atom id="2" element="C" name="CA">
       ...
     </sidechain>
   </residue>
   ...
</peptidechain>
Advantage: clear semantics
Disadvantage: rigid structure

Agreed. main comment is that you are inventing a language specifically for 
proteins. What happens for other groups?

* Generic tree representation
<group type="peptidechain">
   <group type="residue" name="ALA">
     <group name="peptidegroup">
       <atom id="1" element="N" name="N">
       ...
     </group>
     <group name="sidechain">
       <atom id="2" element="C" name="CA">
       ...
     </group>
   </group>
   ...
</group>
Advantage: very flexible
Disadvantage: analysis of structure takes some effort

These are a good illustration of the varieties of abstraction. If you want 
hierarchy I would prefer 3 to 2 if it can be worked out, but note that it 
requires more software and constraint checking.

Note that you can also have references:
<peptidechain>
   <residue idRef="ALA" sequencenumber="1"/>
   <residue idRef="GLU" sequencenumber="2">
   ...
</peptidechain>

and in a separate file:

   <residue id="ALA" sequencenumber="1">
     <peptidegroup>
       <atom id="1" element="N" name="N">
       ...
     </peptidegroup>
     <sidechain>
       <atom id="2" element="C" name="CA">
       ...
     </sidechain>
   </residue>

XML has only very limited support for id->idref but it is a fundamental concept

P.