[Molecularmechanics] Re: Re: Some general remarks.
Peter Murray-Rust
molecularmechanics@tddft.org
Mon, 24 Nov 2003 11:39:08 +0000
At 11:02 24/11/2003 +0100, Konrad Hinsen wrote:
>On Saturday 22 November 2003 04:00, Martin Field wrote:
>
> > At least for the specification and comparison of system composition, there
> > would seem to be a natural overlap between fsatom/MM and many areas
> > of chemi-informatics. In that case wouldn't it be more logical to use or
> > build upon some of these technologies? Examples that come to mind include
> > smiles, unique smiles and its extensions and IUPAC's IChi.
>
>All I know is smiles, which I think is not sufficient for our needs, and
>difficult to build on. Smiles represents only the elements and the bond
>structure in between them. There is no way to attach any other information,
>not even unique atom names to which one could refer from elsewhere. There
>isn't any support for intra- or supramolecular structure either. From what I
>could find about IChi, it seems to have similar aims as smiles, so I expect
>it to have similar shortcomings for specifying simulation systems.
SMILES is not powerful enough for FSATOM needs. A Unique SMILES doubles as
an identifier and a connection table. It is a useful approach but there are
several incompatible unique SMILES and they are all closed source
IChI is designed as an Open approach to *unique chemical identifiers* and
IMO it does this excellently for molecules for which a unique connection
table can be written. I have just come back from the IChI/XML review
meeting at NIST and am very excited about its applicability. IChI is
expressed in XML syntax.
IChI's role for FSATOM would be precisely confined to the identification of
molecules and perhaps their lookup in a database. It would do very well for
molecules and fragments as it honours H-atom counts and charges. As
examples we have put the whole of the NCI database (250K molecules) and can
retrieve a single precise molecule in 50 milliseconds. As an example FSATOM
could index fragments with IChI.
IChI does not support conformations but CML is designed to support multiple
sets of properties for a molecule so (say) prolyl could be indexed by IChI
and different conformations held in CML.
>What I like about XML-based formats is their extensibility, it is always
>possible to add on more information in such a way that programs that don't
>need it can just ignore it. I think this is important in a field that is
>still rapidly evolving.
>
> > Fine. But all conventions can be abused or not adhered too. The problem
> > is to make it as easy as possible to use (e.g. automatic generation of
> > unknown fragments, of atom names, etc.).
XML provides string support for validation - adhering to the rules. Thus
incorrect syntax, and object structures can be precisely determined. In the
original design of XML we decided that "nearly correct" was not good enough
- something was either valid or invalid (the "Draconian" approach). Some
validating XML parsers will fail on the first invalidity - e.g. MSXML as
set up by default; others will list all errors.
>Right. We should not only produce a file format recommendation, but also
>reference implementations and tools for common transformations, to make all
>this as easy as possible.
This is essential IMO. We adopted it for XML and it turned out to have many
benefits. Some specs were too hard to implement so they were revised. Some
commercial suppliers (IBM, etc.) released their previously closed code as
open.
-- Eugen* Leitl wrote:
"I think SMILES can't be included in XML without armoring. "
It is worth deciding on XML character data at an early stage. XML uses the
ASCII character set - essentially chars 32-127 and whitespace (this is a
simplification but it is sufficient here). I appreciate there are many
FSATOMers whose language uses characters outside this set but I would
recommend against using other encodings at this stage as managing them is
a significant overhead. In which case the rules are:
- by default all space in XML documents *is significant*.
- whitespace in attributes CDATA is always normalised. So atom=" CA" and
atom="CA" are identical
- whitespace in content PCDATA is always significant. So <atom> CA</atom>
and <atom>CA</atom> are different
- whitespace between tags is significant unless there is a DTD or Schema.
So prettyprinting your XML actually adds significant spaces unless the
schema forbids it (i.e. no mixed content) (CML has no mixed content so
space between tags is ignorable).
- XML names (element names or IDs) must start with a letter and contain
only alphanumerics and :-_. Thus <1> and <C1'> are always illegal. If an
id is declared as type ID in the schema/DTD then constructs like id="C1'"
are invalid. (CML does not use type ID)
- any printable ASCII character can be used. '<' and '&' must always be
escaped to < and &. "' and > may, but usually need not, be escaped
to ", ' and >
- XML has only the 5 default entities above. So is not recognised
(but   is). It is probably a poor idea to use any of these characters
as after XML processing they will be transformed to their raw
representation. Thus two passes of processing will normally fail unless the
XML serialization has an escaping mechanism.
For FSATOM I would suggest that as far as possible whitespace and text was
normalised (as in XHTML) so that leading and trailing whitespace is removed
and internal whitespace is normalised to a single space character. The only
place where it might be important is the horrible PDB construct " CA" and
"CA" and I hope this is not used as it is very fragile.
So SMILES does not need "armoring", but some constructs in SMARTS and
extensions may use characters which need escaping.
Peter