[Molecularmechanics] Re: Re: Some general remarks.

Peter Murray-Rust molecularmechanics@tddft.org
Mon, 24 Nov 2003 11:39:08 +0000


At 11:02 24/11/2003 +0100, Konrad Hinsen wrote:
>On Saturday 22 November 2003 04:00, Martin Field wrote:
>
> > At least for the specification and comparison of system composition, there
> > would seem to be a natural overlap between fsatom/MM and many areas
> > of chemi-informatics. In that case wouldn't it be more logical to use or
> > build upon some of these technologies? Examples that come to mind include
> > smiles, unique smiles and its extensions and IUPAC's IChi.
>
>All I know is smiles, which I think is not sufficient for our needs, and
>difficult to build on. Smiles represents only the elements and the bond
>structure in between them. There is no way to attach any other information,
>not even unique atom names to which one could refer from elsewhere. There
>isn't any support for intra- or supramolecular structure either. From what I
>could find about IChi, it seems to have similar aims as smiles, so I expect
>it to have similar shortcomings for specifying simulation systems.

SMILES is not powerful enough for FSATOM needs. A Unique SMILES doubles as 
an identifier and a connection table. It is a useful approach but there are 
several incompatible unique SMILES and they are all closed source

IChI is designed as an Open approach to *unique chemical identifiers* and 
IMO it does this excellently for molecules for which a unique connection 
table can be written. I have just come back from the IChI/XML review 
meeting at NIST and am very excited about its applicability. IChI is 
expressed in XML syntax.

IChI's role for FSATOM would be precisely confined to the identification of 
molecules and perhaps their lookup in a database. It would do very well for 
molecules and fragments as it honours H-atom counts and charges. As 
examples we have put the whole of the NCI database (250K molecules) and can 
retrieve a single precise molecule in 50 milliseconds. As an example FSATOM 
could index fragments with IChI.

IChI does not support conformations but CML is designed to support multiple 
sets of properties for a molecule so (say) prolyl could be indexed by IChI 
and different conformations held in CML.

>What I like about XML-based formats is their extensibility, it is always
>possible to add on more information in such a way that programs that don't
>need it can just ignore it. I think this is important in a field that is
>still rapidly evolving.
>
> > Fine. But all conventions can be abused or not adhered too. The problem
> > is to make it as easy as possible to use (e.g. automatic generation of
> > unknown fragments, of atom names, etc.).

XML provides string support for validation - adhering to the rules. Thus 
incorrect syntax, and object structures can be precisely determined. In the 
original design of XML we decided that "nearly correct" was not good enough 
- something was either valid or invalid (the "Draconian" approach).  Some 
validating XML parsers will fail on the first invalidity - e.g. MSXML as 
set up by default; others will list all errors.

>Right. We should not only produce a file format recommendation, but also
>reference implementations and tools for common transformations, to make all
>this as easy as possible.

This is essential IMO. We adopted it for XML and it turned out to have many 
benefits. Some specs were too hard to implement so they were revised. Some 
commercial suppliers (IBM, etc.) released their previously closed code as 
open.

-- Eugen* Leitl wrote:

"I think SMILES can't be included in XML without armoring. "

It is worth deciding on XML character data at an early stage. XML uses the 
ASCII character set - essentially chars 32-127 and whitespace (this is a 
simplification but it is sufficient here). I appreciate there are many 
FSATOMers whose language uses characters outside this set but I would 
recommend against using other encodings at this stage  as managing them is 
a significant  overhead.  In which case the rules are:

- by default all space in XML documents *is significant*.
- whitespace in attributes CDATA is always normalised. So atom=" CA" and 
atom="CA" are identical
- whitespace in content PCDATA is always significant. So <atom> CA</atom> 
and <atom>CA</atom> are different
- whitespace between tags is significant unless there is a DTD or Schema. 
So prettyprinting your XML actually adds significant spaces unless the 
schema forbids it (i.e. no mixed content) (CML has no mixed content so 
space between tags is ignorable).
- XML names (element names or IDs) must start with a letter and contain 
only alphanumerics and :-_. Thus <1> and <C1'>  are always illegal. If an 
id is declared as type ID in the schema/DTD then constructs like id="C1'" 
are invalid. (CML does not use type ID)
- any printable ASCII character can be used. '<' and '&' must always be 
escaped to &lt; and &amp;. "' and > may, but usually need not, be escaped 
to &quot;, &apos; and &gt;
- XML has only the 5 default entities above. So &nbsp; is not recognised 
(but &#160; is). It is probably a poor idea to use any of these characters 
as after XML processing they will be transformed to their raw 
representation. Thus two passes of processing will normally fail unless the 
XML serialization has an escaping mechanism.

For FSATOM I would suggest that as far as possible whitespace and text was 
normalised (as in XHTML) so that leading and trailing whitespace is removed 
and internal whitespace is normalised to a single space character. The only 
place where it might be important is the horrible PDB construct " CA" and 
"CA" and I hope this is not used as it is very fragile.

So SMILES does not need "armoring", but some constructs in SMARTS and 
extensions may use characters which need escaping.

Peter