dwangschematiek: SMILES

SMILES

2023-10-27 (permalink)

Idea: understanding SMILES deeply by parsing it.

C[C@]12CC[C@@h]3c4ccc(cc4CC[C@H]3[C@@h]1CC[C@@h]2O)O

looking into it: starting with the spec edition

evening

I have a pretty okay grasp off SMILES that I have encountered before. This will not really get me anywhere in an effort of parsing these expressions, though. Now, I want to try to get a good rigorous understanding of the SMILES format.

Luckily I found a specification.

Reading it while listening to Plume by Loscil. What follows is me taking notes to use while implementing, right from the document. That means that parts of this cannot be considered my original writing, but the work of the OpenSMILES authors.

Apparently, there is also a propriatary specification from Daylight Chemical Information Systems, but this one’s open, ya’know. This OpenSMILES spec looks very good, though. I really like how it starts out sketching the context.

To us here, a molecule is a chemical graph of atoms (nodes), possibly connected by bonds (edges), which can be single, double, triple (even quadruple, I think? see $?). The spec acknowledges the great short-comings of this view, and how that affects what can be represented by SMILES. This is very good prose as far as specifications go.

grammar

Now onto the formal grammar. Obvious distinction between syntax and semantics is made.

atoms
                  atom ->   bracket_atom
                          | aliphatic_organic
                          | aromatic_organic
                          | '*'

organic subset atoms
     aliphatic_organic -> one from ( B C N O S P F Cl Br I )
      aromatic_organic -> one from ( b c n o s p )

bracket atoms
          bracket_atom -> '[' isotope? symbol chiral? hcount? charge? class? ']'
                symbol -> element_symbols | aromatic_symbols | '*'
               isotope -> NUMBER
       element_symbols -> one from ( well you know what goes here... )
      aromatic_symbols -> one from ( b c n o p s se as )

chirality
                chiral -> one from (
                            @ @@ @TH1 @TH2 @AL1 @AL2 @SP1 @SP2 @SP3
                            (@TB1 thru @TB20) (@OH1 thru @OH30)
                            (@TB DIGIT DIGIT) (@OH DIGIT DIGIT)
                          )

hydrogens
                hcount -> 'H' | 'H' DIGIT

charges
                charge ->   '-'
                          | '-' DIGIT? DIGIT
                          | '+'
                          | '+' DIGIT? DIGIT
                          | '--' deprecated | '++' deprecated

atom class
                 class -> ':' NUMBER

bonds and chains
                  bond -> '-' | '=' | '#' | '$' | ':' | '/' | '\'
              ringbond -> bond? DIGIT | bond? '%' DIGIT DIGIT
         branched_atom -> atom ringbond* branch*
                branch -> '(' chain ')' | '(' bond chain ')' | '(' dot chain ')'
                 chain ->   branched_atom
                          | chain branched_atom
                          | chain bond branched_atom
                          | chain dot branched_atom
                   dot -> '.'

smiles strings
                smiles -> terminator | chain terminator
            terminator ->   SPACE
                          | TAB
                          | LINEFEED
                          | CARRIAGE_RETURN
                          | END_OF_STRING

(this, if you couldn’t tell, is my own keysmashing-interpretable-as-grammar copied from the spec’s)

atoms

Hihihihi, you can date the spec by its reference to the “114 valid atomic symbols”.

What does the * mean in the atom and symbol definitions?

The symbol * is also accepted as a valid atomic symbol, and represents a “wildcard” or unknown atom.

Note that you may encounter this in the following form: [*].

hydrogens

So… hydrogens. They are mostly implied by omission. When they are written inside of brackets, they must be followed by a number to indicate how many are specified, or when they are not followed by a number, one is implied. Apparently, H0 is also valid, meaning no hydrogen, i.e., [C] and [CH0] specify an identical molecule.

When they are specified without anything beyond this, the have “undefined isotope, no chirality, no other bound hydrogen, neutral charge, and an undefined atom class.”

A hydrogen itself cannot have a hydrogen count ([HH1] would be illegal, for example), but other atoms can, of course. But how do we represent a bond between two hydrogens, then? Well, we can say [H][H] to specify molecular hydrogen (H₂), for example. So in that case we need to write them as explicit atoms in square brackets.

Woah, there’s questions in this spec?!

Question: are more than 9 hydrogens possible? Should they be supported?

Initially I interpreted this as a didactic exercise left to the reader. But it is more of a you will never realistically see this, but we do consider it kind of expression, I think.

charge

Moving on to charge. Ah, so the parser should accept ++ and -- (meaning +2 and -2) charge symbols, but programs oughtn’t output these.

isotopes

With isotopes, the number goes before the atomic symbol. Leading zeroes are allowed. But note that an isotope number in fact indicates a 0-isotope (whut), and is not equivalent to the naturally-occurring ratio for that element. Non-realistic or totally bogus isotopes are fair game to the parser. We should accept isotope values of at least three digits, ranging from 0 to 999. At least, huh? In our implementation, we might as well parse the number up until the element symbol.

organic subset

Okay, now onto the organic subset. This is a set of atoms that can be written as only their atomic symbol, without the square brackets, their H-count, and more of that fluff.

Here’s the organic subset: B, C, N, O, P, S, F, Cl, Br, I, *.

An atom is specified this way has the following properties:

“implicit hydrogens” are added such that valence of the atom is in the lowest normal state for that element

the atom’s charge is zero

the atom has no isotopic specification

the atom has no chiral specification

The implicit hydrogen count I is determined as follows: S = sum of the bond orders of the bonds connected to the atom. If S = a known valence for the element or S > any known valence, then I = 0. Else, I = S - next highest known valence.

normal valence

But what is this ‘normal valence’, then?

Element    Valence
-------    -------
B          3
C          4
N          3 or 5
O          2
P          3 or 5
S          2, 4 or 6
halogens   1
*          unspecified

Well, okay. Makes sense. I know these by heart except for those of S, I guess.

There’s some more atom properties (viz, chirality and ring-closures), but we’ll get to those later, apparently.

But that wildcard symbol (*). What’s up with that gal? It represents an atom of which we do not know the atomic number, or for which it is left unspecified. It can even have an isotope, chirality, hydrogen count, and charge—provided it is in square brackets. When it occurs outside of square brackets, we just assume no isotope, zero mass, unspecified chirality, zero hydrogen count, and zero charge. Most neutral thing around. This means that when it is outside of brackets, we can assign it a valence only based on its bonds (and that it’s satisfied, i.e., does not want any more hydrogens around it). If it is inside of brackets, we can assign the valence based on the bonds and on the hydrogen count and perhaps charge. What about the case that it is part of a potentially aromatic ring? We can consider the ring aromatic, if the * could be directly replaced by an atom that would enable the ring to be aromatic.

atom class

Atomic class struggle, now. The “atom class” is some integer that has no chemical meaning. The meaning assigned is application-specific. Multiple atoms can have the same atom class. You write it like this: [CH4:2], where the :2 assigns the class of 2 to the C atom (?unclear whether it is to the C, but it would make sense to me). I really did not know of this before. Never encountered it. Interesting! If the class is not set, it is zero, and it can have any number of digits, I think, and can have leading zeroes.

bonds

I’ve been going through this spec quite a stretch now. Really bonding with this document actually. Wow wait, bonds are actually the next topic!? Atoms that are right next to each other are assumed to be joined by a single or an aromatic bond. Double, triple, quadruple bonds are represented by =, #, $. Single bonds can be explicitly expressed as - but it’s rare. :, \, and / are also bonds (for aromaticity, and cis/trans positions, I believe), but we’ll get to those later.

rhenium/rhodium mistake nitpick

Hey I think I found a mistake in the spec. Here, in an example, they give the SMILES line for octochlorodirhenate, but use the element symbol Rh for Rhenium, even though the appropriate symbol for Rhenium would be Re. Rh refers to Rhodium. Gosh such a nitpick.

[Rh-](Cl)(Cl)(Cl)(Cl)$[Rh-](Cl)(Cl)(Cl)Cl — octachlorodirhenate (III)

God the authors of this spec are probably not exactly waiting for some random to come and nitpick their work, so we’ll just leave this flowing in the wind, I guess. There to be found by other explorers of boring worlds. Also, octachlorodirhenate is v cool.

branching

Well, before I branch off too fa— wait, once again we have suddenly landed on the next topic. This time: branches. Using normal ( parentheses ) we can specify a branched part of the molecule. The branch is connected to the atom before it, and the part after the branch parenthetical is also connected to that same atom. Three-way branches (or n-way, for that matter), are specified by adding more parentheticals directly after the first one.

Found another nit in the spec in this branches section: in one of the example tables, there’s a “pic here” without said pic present. Guess we’ll have to dream up what 2-propyl-3-isopropyl-1-propanol looks like ourselves then… smh.

continued on 2023-10-28, afternoon

rings

~~Molly Ringwald.~~ No, I meant rings. This is interesting because we somehow need to cyclize our thusfar acyclic graph. The first occurrence of a ring-closure number rnum (e.g., 1 in C1CCC...) creates an open bond for the atom that comes right before the rnum. When the same rnum is found later on, the bond is established between the two atoms. Cool, but how do we represent a double bond as the ring-closing bond? In that case, the preferred form is to have the double bond marker = sit between the atom and the rnum, and in the later second rnum, the atom is simply adjacent to the number, without bond marker. For example: C=1CCCCC1. It is also valid to have both cases marked with a double bond—e.g., C=1CCCCC=1, or have the order swapped around, such as in C1CCCCC=1. But, the higher-order bond must either be present or left implied. No other-ordered bond can enter the mix, for that creates ambiguity. The following case is invalid, for example: C-1CCCCC=1. Obviously, the ring closures must be matched. You cannot declare an rnum somewhere, and not close it. But once closed, the number becomes available again! In other words: C1CCCCC1C1CCCCC1 is a valid SMILES. It is illegal to bind an atom to itself through something horrid like C11. Moreover, it is illegal to create two (in this case implied single) bonds between two of the same atoms, like C12CCCCC12. Two bonds between one pair of atoms is also illegal: C12C2CCC1. Here, the first two carbons are already implied to be bonded. Adding the 2–2 bonds using rnums violates da rules.

aromaticity

The concept as used in SMILES does not necessarily imply any physicochemical properties of the bonds. It is just information about the bond lengths that cannot be captured by the single-bond double-bond alternating pattern. There is a uniformity to these ‘aromatic’ bonds that is meant. The Kekule alternating single/double bond style can be show with alternating (implied) singles and doubles. Lowercase letters can also be used to show aromatic bonds. No bond symbols are needed in that case.

A lowercase aromatic symbol is defined as an atom in the sp² configuration in an aromatic or anti-aromatic ring system.

To my surprise, arsenic (As) and selenium (Se) can also participate in aromatic systems. They are denoted as and se in the aromatic small letters set. Actually, their aromaticity is not that surprising at all, since they are in the same periodic group as nitrogen and oxygen, respectively. The more you know.

Kekule is always acceptable as input, but for output, the aromatic form is preferred.

Skipping extended Hueckel’s rule.

In an aromatic system, all of the aromatic atoms must be sp² hybridized, and the number of π electrons must meet Huckel’s 4n+2 criterion.

The parser must note the aromatic designation of each atom on input, and when parsing is complete, the program must verify that the electrons can be assigned without violating the valence rules as would be consistent with their sp² markings, their specified or implied hydrogens, external bods, and charges.

The aromatic-bond symbol : can signify an aromatic bond between aromatic atoms, but it is never necessary. A bond between two aromatic atoms is assumed to be aromatic, but this can be overwritten by an explicit single bond -. This is needed, for example, in biphenyl, as the bond between the two phenyl rings is not actually aromatic, but the cyclic bonds are.

hydrogen

Now some more on hydrogen. They can be represented in three ways: (1) as implicit hydrogens, where H-count is determined from the normal valence; (2) as an atom property, by placing them after an atom, where the H-count is specified for the heavy atom; (3) explicitly individually in square brackets, and as such, they are represented as normal atoms. The first case is only possible with the organic subset, and—the inverse—any atom in square brackets must have the hydrogens explicitly represented, as a hydrogen count ([CH4]), or as normal atoms ([H]C([H])([H])[H]). Combinations of explicit and H-count hydrogens are allowed, and in that case the atom’s total hydrogen count is the sum of the atomic h-count property and the explicitly attached hydrogens. Any of the following cases should be an explicitly represented hydrogen:

charge ([H+])
connection between hydrogens ([H][H])
hydrogens with more than one bond (weird but apparently called bridging hydrogens)
deuterium and tritium ([2H], [3H])

disconnected structures

Cool, so some SMILES actually represent disconnected structures, such as sodium chloride. These can be represented as [Na+].[Cl-]. The dot can be placed almost anywhere where a bond is also allowed. This leads to some cursed SMILES, but it’s all legal I guess. But a dot between an atom and a ring-closure digit is not allowed.

Moving on to more cursed shit, ethane can be represented as CC, but also as C1.C1. But this is pretty nice, though, for creating molecules from fragment libraries through simple string concatenation operations.

stereochemistry

So, looking at the stereochemistry section… this looks like a lot of details that I may be able to neglect for a first implementation. I think I’m going to quickly skim over this.

SMILES apparently cannot represent some kinds of stereochemistry: gross left or right handedness (like helices), mechanical interferences, gross conformational stereochemistry such as protein folding. Makes sense.

tetrahedral

To see with which order a tetrahedral chiral center is built, we look along the bond from the preceding atom to the chiral atom. If that chiral atom has one trailing @, we read the order as anticlockwise. With @@, we read it as clockwise. (See the direction of the little ‘tail’ of the @ symbol.) That means that these two SMILES are equivalent: N[C@](Br)(O)C and N[C@@](Br)(C)O

cis/trans

We can denote the specific cis/trans configuration around double bonds with / and \. From these symbols, we can derive the following visual understanding: they can be seen as bonds that point above or below the alkene bond. So, F/C=C/F is trans-difluoroethene, and F\C=C/F is cis-difluoroethene.

Note that these are equivalent, which may be surprising: F/C=C/F and C(\F)=C/F both represent trans-difluoroethene. The interpretation of the up or down direction is with respect to the related carbon atom, not to the position of the double bond.

Only one of the two groups attached to the double bonded atom needs to be marked, since the upness or downness of the unmarked group can be inferred from the marked group. Also, this system applies to odd-numbered long boi allenes such as F\C=C=C=C/F (trans-difluorobutatriene).

tetrahedral interpretation of even-numbered allenes

For even-numbered allenes, the tetrahedral rules apply, but the ‘neighbours’ are the groups at the two ends of the allene chain. To determine which the right (anti)clockwise specification is, we conceptually collapse the allene to a single tetrahedral center.

ethane/ethene: another typo nitpick

By the way, found another bug in the spec: in the spec these last two molecules are called ‘{trans,cis}-difluoroethane’, which is incorrect, and difluoroethane cannot be cis or trans, since it has a single bond. What they meant was difluoroethene, which has a rigid double bond, and can therefore display cis/trans configurations.

skipping a bunch of grim stereochemistry stuff

For now, I’m going to skip over the square planar centers (@SP), trigonal bipyramidal centers (@TB), and octahedral centers (@OH). I like a challenge, but I’m just not ready for that one right now. Hey, apparently “[v]ery few SMILES systems actually implement the rules for SP, TB or OH chirality.” Cool so at least I’m not alone in this I guess.

partial stereochemistry

Partial stereochemistry can be denoted by not specifying the stereochemistry of some atoms, but not of others.

termination

Ending the parsing spec on the topic of termination: a SMILES string is terminated by the end of the string or by any whitespace. Other data may follow the SMILES string after the whitespace, and parsers should ignore this data.

done reading for now

Owwkay so now I probably know enough about the format itself to parse it. I actually gotta start at some point, don’t I?

examples masterdoc

here I may list examples that I encounter in some standard form that will allow me to eventually write unit tests based on these? for the start at least to check my understanding and the basic steps of the parser.

I found that on the OpenSMILES GitHub repo there’s some files with example SMILES data. May turn out to be a nice source.

open questions

What are the ‘se’ and ‘as’ aromatic symbols?
What is atom class?
What’s up with that weird chirality 1 … 30 stuff?
What atom does the atom class refer to in something like [CH4:2]?

Also, note to self: probably mark the Element enum as non-exhaustive, since more elements may be discovered. (Not that there’s much chemistry to do with them, in all fairness.)