Jmol 3D-SEARCH bioSMILES/bioSMARTS

Robert M. Hanson
Department of Chemistry
St. Olaf College
5/19/2010

This document describes a specification for an extension of SMILES and SMARTS for use in 3D molecular atom search and selection as well as biomolecular sequence and cross-link searching. This specification is implemented in Jmol 12.0. It is really a set of specifications:

The org.jmol.smiles package provides extensive functionality for selecting atoms within a three-dimensional model based on SMILES and SMARTS strings. This package may be used independently of Jmol -- see JmolSmilesApplet.java and JmeToJmol.htm.

Besides a presentation of general considerations, a detailed specification for syntax, and the term "aromatic" is defined.

General Considerations

format

bioSMILES/bioSMARTS aromaticity

Comparision to Daylight SMILES

All single-component aspects of Daylight SMILES are implemented, including aromaticity and atom- and bond-based stereochemistry ("chirality").

Comparision to Daylight SMARTS

primitives Jmol atom selection implicit hydrogen count

Detailed Jmol SMILES/bioSMILES Specification

 
      # note: prior to parsing, all white space is removed
       
   [smilesDef] == [preface] [smiles]
   [preface] == { [flagDefs] | NULL } 
   [flagDefs] == { [flagDef] || [flagDef] [flagDefs] }
   [flagDef] == "/" [processingFlags] "/"
   [processingFlags] == { [processingFlag] | [processingFlag] [processingFlags] }
   [processingFlag] == { "noAromatic" | "noStereo" } (case-insensitive)
      # note: the noAromatic flag indicates to not distinguish between
      #       aromatic/aliphatic searches -- "C" and "c"
      # note: the noStereo flag turns off all stereochemical testing
      # note: thus, both "/noAromatic//noStereo/" and "/noAromatic noStereo/" are valid 
   [smiles] == { [entity] | [entity] "." [entity] }
   [entity] == { [bioSequence] | [molecularSequence] }
   [molecularSequence] = [node][connections] 
   [node] == { [atomExpression] | [connectionPointer] }

   [atomExpression] = { [unbracketedAtomType] 
                             | "[" [bracketedExpression] "]" }
   
   [unbracketedAtomType] == [atomType] 
                                 & ! { "Ac" | "Ba" | "Ca" | "Na" | "Pa" | "Sc"
                                     | "ac" | "ba" | "ca" | "na" | "pa" | "sc" }
      # note: Brackets are required for these elements: [Na], [Ca], etc.
      #       These elements Xy are instead interpreted as "X" "y", a single-letter
      #       element followed by an aromatic atom. 
      
   [atomType] == { [validElementSymbol] | [aromaticType] }
   [validElementSymbol] == (see Elements.java; 
                            including Xx and only through element 109)
   [aromaticType] == { [validElementSymbol].toLowerCase() }
       
   [bracketedExpression] == { "[" [atomPrimitives] "]" } 
   
   [atomPrimitives] == { [atom] | [atom] [atomModifiers] }
   [atom] == { [isotope] [atomType] | [atomType] } 
   [isotope] == [digits]
       # note -- isotope mass must come before the element symbol. 
   [digits] == { [digit] | [digit] [digits] }
   [digit] == { "0" | "1" | "2" | "3" | "4" | "5" | "6" | 7" | "8" | "9" }
   [atomModifiers] == { [atomModifier] | [atomModifier] [atomModifiers] }
   [atomModifier] == { [charge] | [stereochemistry] | [H_Prop] }
   [charge] == { "-" [digits] | "+" [digits] | [plusSet] | [minusSet] }
   [plusSet] == { "+" | "+" [plusSet] }
   [minusSet] == { "-" | "-" [minusSet] }
   [stereochemistry] == { "@"           # anticlockwise
                              | "@@"    # clockwise
                              | "@" [stereochemistryDescriptor] 
                              | "@@" [stereochemistryDescriptor] }
   [stereochemistryDescriptor] == [stereoClass] [stereoOrder]
   [stereoClass] == { "AL" | "TH" | "SP" | "TP" | "OH" }
   [stereoOrder] == [digits]
   
   [connectionPointer] == { "%" [digit][digit] | [digit] | "%(" [digits] ")"}
      # note: all connectionPointers must have a second matching connectionPointer 
      #       and must be preceded by an atomExpression for the 
      #       first occurance and either an atomExpression or a bond
      #       for the second occurance
      # note: Jmol bioSMARTS extends the possible number of rings to > 100 by 
      #       allowing %(n)

   [connections] == [connection] | NULL }
   [connection] == { [branch] | [bond] [node] } [connections]
   [branch] == { "(" { [smiles] | [bond] [smiles] } ")" | "()" }
      # note: empty parentheses "()" are ignored in SMILES and bioSMILES
   [bond] == { "-" | "=" | "#" | "." | "/" | "\\" | ":" | NULL
      # note: Jmol will not match two totally independent molecular pieces. For example,
      #       Jmol will not math [Na+].[Cl-]. However, "." can be used to clarify a
      #       structure that has "ring" bond notation:
      #       CC1CCC.C1CC   is a valid structure.
      # note: bioSEQUENCE uses ":" to indicate "cross-linked", which is the default for branches

   [bioSequence] == [bioCode] [bioNode] [connections]
   [bioCode] == { "~" | "~" [bioType] "~" }
      # note: The "~" must be the first character in a component and must be repeated 
      #       for each component (separated by ".")
   [bioType] == { "p" | "n" | "r" | "d" }
      # note: protein, nucleic, RNA, DNA
   [bioNode] == { "[" [bioResidueName] "." [bioAtomName] "]" 
                 | "[" [bioResidueName] "." [bioAtomName] "#" [atomicNumber] "]" 
                 | [bioResidueCode] } 
   [atomicNumber] == [digits]
   [bioResidueName] == { "ARG" | "GLY" ... } (case-insensitive) 
   [bioAtomName] == {"C" | "CA" | "N" ... } (case-insensitive)
   [bioResidueCode] == { "A" | "R" | "G" ... } (case-sensitive)
      # note: In a BioSEQUENCE, residues are designated using standard 1-letter-code group names
      #       or bracketed residues [xxx] with optional atoms specified: [ARG], [CYS.SG]. 

Detailed Jmol 3D-SMARTS/bioSMARTS Specification

 

 ######## GENERAL ########

      # note: prior to parsing, all white space is removed

   [smartDef] == [preface] [smartsSet]
   [preface] == { [flagDefs] [variableDefs] | [variableDefs] | NULL } 
   [flagDefs] == { [flagDef] || [flagDef] [flagDefs] }
   [flagDef] == "/" [processingFlags] "/"
   [processingFlags] == { [processingFlag] | [processingFlag] [processingFlags] }
   [processingFlag] == { "noAromatic" | "noStereo" } (case-insensitive)
      # note: the noAromatic flag indicates to not distinguish between
      #       aromatic/aliphatic searches -- "C" and "c"
      # note: the noStereo flag turns off all stereochemical testing
      # note: thus, both "/noAromatic//noStereo/" and "/noAromatic noStereo/" are valid 
   [variableDefs] == [variableDef] | [variableDef] [variableDefs]
   [variableDef] ==  "$" [label] "=" "\"" [smarts] "\"" [comments] ";"
   [label] == [any characters other than "=" and "$", and not starting with "("]
   [comments] == [any characters other than ";"]
      # note: Variable definitions must be parsed first. 
      #       After that, all variable references [$XXXX] are replaced
      
   [smartsSet] == { [smarts] | [smarts] "||" [smartsSet] }
      # note: Jmol adds the "or" operation "||", for example: "C=O || C=N"
      #       which, in this case, could also be written as "C=[O,N]"
      #       Jmol preprocesses these sets, evaluates them independently, and then
      #       combines them.
      
   [smarts] == { [node3D] [connections] | [bioSequence] } 
   [connections] == [connection] | NULL }
   [connection] == { [branch] | [bondExpression] [node3D] } [connections]
   [branch] == { "(" { [smarts] | [bondExpression] [smarts] } ")" | "()" }
      # note: Default bonding for a branch is single for SMARTS or cross-linked (:) for bioSEQUENCE
      # note: "()" is ignored in SMARTS and indicates "not cross-linked" in bioSEQUENCE
   
 ######## ATOMS ########
    
   [node3D] == { [atomExpression] | [atomExpression] "(." [measure] ")" | [connectionPointer] }
   [atomExpression] = { [unbracketedAtomType]
                             | [bracketedExpression] 
                             | [multipleExpression]
                             | [nestedExpression] }
   
   
   [unbracketedAtomType] == [atomType] 
                                 & ! { "Ac" | "Ba" | "Ca" | "Na" | "Pa" | "Sc"
                                     | "ac" | "ba" | "ca" | "na" | "pa" | "sc" }
      # note: Brackets are required for these elements: [Na], [Ca], etc.
      #       These elements Xy are instead interpreted as "X" "y", a single-letter
      #       element followed by an aromatic atom. 
      # note: in a bioSEQUENCE, all atom types are 1-letter code group names
      
   [atomType] == { [validElementSymbol] | "A" | [aromaticType] | "*" }
   [validElementSymbol] == (see Elements.java; 
                            including Xx and only through element 109)
   [aromaticType] == { "a" | [validElementSymbol].toLowerCase() }
       
   [bracketedExpression] == "[" { [atomOrSet] | [atomOrSet] ";" [atomAndSet] } "]" 
   
   [atomOrSet] == { [atomAndSet] | [atomAndSet] "," [atomAndSet] }
   [atomAndSet] == { [atomPrimitives] | [atomPrimitives] "&" [atomAndSet]
                              | "!" [atomPrimitive] 
                              | "!" [atomPrimitive] "&" [atomAndSet] }
                              
 ######## ATOM PRIMITIVES ########

   [atomPrimitives] == { [atomPrimitive] | [atomPrimitive] [atomPrimitives] }
       # note -- if & is not used, certain combinations of primitiveDescritors
       #         are not allowed. Specifically, combinations that together
       #         form the symbol for an element will be read as the element (Ar, Rh, etc.)
       #         when NOT followed by a digit and no element has already been defined 
       #         So, for example, [Ar] is argon, [Ar3] is [A&r3], [ORh] is [O&R&h],  
       #         but [Ard2] is [Ar&d2] -- "argon with two non-hydrogen connections"
       #         Also, "!" may not be use with implied "&". 
       #         Thus, [!a], [!a&!h2], and [h2&!a] are all valid, but [!ah2] is invalid.             
   [atomPrimitive] == { [isotope] | [atomType] | [charge] | [stereochemistry]
                              | [a_Prop] | [A_Prop] | [D_Prop] | [H_Prop] | [h_Prop] 
                              | [R_Prop] | [r_Prop] | [v_Prop] | [X_Prop]
                              | [x_Prop] | [nestedExpression] }
   [isotope] == [digits] | [digits] "?"
       # note -- isotope mass may come before or after element symbol, 
       #         EXCEPT "H1" which must be parsed as "an atom with a single H" 
   [digits] == { [digit] | [digit] [digits] }
   [digit] == { "0" | "1" | "2" | "3" | "4" | "5" | "6" | 7" | "8" | "9" }
   [charge] == { "-" [digits] | "+" [digits] | [plusSet] | [minusSet] }
   [plusSet] == { "+" | "+" [plusSet] }
   [minusSet] == { "-" | "-" [minusSet] }
   [stereochemistry] == { "@"           # anticlockwise
                              | "@@"    # clockwise
                              | "@" [stereochemistryDescriptor] 
                              | "@@" [stereochemistryDescriptor] }
   [stereochemistryDescriptor] == [stereoClass] [stereoOrder]
   [stereoClass] == { "AL" | "TH" | "SP" | "TP" | "OH" }
   [stereoOrder] == [digits]
       # note -- "?" here (unspecified) is not relevant in 3D-SEARCH 
   
   [A_Prop] == "#" [digits]           # elemental atomic number
   [a_Prop] == "=" [digits]           # atom index (starts with 0)
   [D_Prop] == { "D" [digits] | "D" } # degree -- total number of connections 
                                      #   excludes implicit H atoms; default 1
   [d_Prop] == { "d" [digits] | "d" } # degree -- non-hydrogen connections
                                      #   default 1 
   [H_Prop] == { "H" [digits] | "H" } # exact hydrogen count 
                                      #   excludes implicit H atoms
   [h_Prop] == { "h" [digits] | "h" } # implicit hydrogens -- "h" indicates "at least one"
                                      #   (see note below)
   [R_Prop] == { "R" [digits] | "R" } # ring membership; e.g. "R2" indicates "in two rings"
                                      #   "R" indicates "in a ring" 
                                      #   !R" or "R0" indicates "not in any ring"
   [r_Prop] == { "r" [digits] | "r" } # in ring of size [digits]; "r" indicates "in a ring"
   [v_Prop] == { "v" [digits] | "v" } # valence -- total bond order (counting double as 2, e.g.)
   [X_Prop] == { "X" [digits] | "X" } # connectivity -- total number of connections
                                      #   includes implicit H atoms
   [x_Prop] == { "x" [digits] | "x" } # ring connectivity -- total ring connections
   
 ######## Nested and Multiple Expressions ########
 
   [nestedExpression] == "$(" [atomExpression] ")"
      # note: nestedExpressions return only the first atom as a match, 
      #       not all atoms in the expression.

   [multipleExpression] == { "[$(" [orExpression] ")" [nTimes] "]" 
                             | "[$(" [orExpression] ")" [nMinimum] "-" [nMaximum] "]" 
                             | "[$(" [orExpression] "|" [orExpression] "]" 
                             | "[$(" [orExpression] "||" [orExpression] "]" }
   [orExpression] = { [atomExpression] 
                       | [atomExpression "|" [orExpression] 
                       | [atomExpression "||" [orExpression] }
      # note: "|" and "||" are synonymous in this inner context; "|" is preferred simply
      #       for readability (whereas "||" is required for the [smartsSet] context). 
      # note: This syntax is carefully written to exclude [$(xxx)] by itself, which
      #       is a nestedExpression, not a multipleExpression. The difference is that
      #       the nestedExpression only returns the first atom, while the multipleExpression
      #       returns all atoms. To return only the first atom within this context 
      #       it is necessary to use a nested expression within the multiple expression.
      #       For example: "CC[$( $(C=O) | $(C=N) )2]"
      #       is the same as "CC$(C=[O,N])$(C=[O,N])", although Jmol preprocesses it as
      #          "CC$(C=O)$(C=O)||CC$(C=O)$(C=N)||CC$(C=N)$(C=O)||CC$(C=N)$(C=N)"
      
   [nTimes] == [digits]
   [nMinimum] == [digits]
   [nMaximum] == [digits]
      # note: multipleExpressions allow for searching a given number of expressions or 
      #       a variable number of expressions (including 0, perhaps)
      #       Jmol pre-processes these expressions and turns them into a set:
      #       pattern1 || pattern2 || pattern3....

 ######## BioSEQUENCE ########

   [bioSequence] == [bioCode] [bioNode] [connections]
   [bioCode] == { "~" | "~" [bioType] "~" }
      # note: The "~" must be the first character in a component and must be repeated 
      #       for each component (separated by ".")
   [bioType] == { "p" | "n" | "r" | "d" }
      # note: protein, nucleic, RNA, DNA
   [bioNode] == { "[" [bioResidueName] "]" 
                 | "[" [bioResidueName] "." [bioAtomName] "]" 
                 | "[" [bioResidueName] "." [bioAtomName] [A_Prop] "]" 
                 | [bioResidueCode] } 
   [bioResidueName] == { "*" | "ARG" | "GLY" ... } (case-insensitive) 
   [bioAtomName] == { "*" | "0" | "C" | "CA" | "N" ... } (case-insensitive)
      # note: "0" indicates the "lead atom":
      #   nucleic: P if present, or H5T if present, or O5'/O5*
      #   protein: CA
      #   carbohydrate: the first atom of the group listed in the model file
   [bioResidueCode] == { "*" | "A" | "R" | "G" ... } (case-sensitive)
      # note: wildcard or standard group 1-letter-code
      #       or, in the case of RNA or DNA:
      #         "N" (any residue; same as "*"), 
      #         "R" (any purine -- A or G)
      #         "Y" (any pyrimidine -- C or T or U)

 ######## CONNECTIONS (aka "rings") ########

   [connectionPointer] == { [digit] | "%" [digit][digit] | "%(" [digits] ")" }
      # note: All connectionPointers must have a second matching connectionPointer 
      #       and must be preceded by an atomExpression for the 
      #       first occurance and either an atomExpression or a bondExpression
      #       for the second occurance. The matching connectionPointers may be
      #       in different "components" (separated by "."), in which case they
      #       represent general connections and not necessarily rings.

 ######## BONDS ########

   [bondExpression] == { [bondOrSet] | [bondOrSet] ";" [bondAndSet] } 
   
   [bondOrSet] == { [bondAndSet] | [bondAndSet] "," [bondAndSet] }
   [bondAndSet] == { [bondPrimitives] | [bondPrimitives] "&" [bondAndSet]
                              | "!" [bondPrimitive] 
                              | "!" [bondPrimitive] "&" [bondAndSet] }
                                              
 ######## BOND PRIMITIVES ########
                              
   [bondPrimitives] == { [bondPrimitive] | [bondPrimitive] [bondPrimitives] }       
   [bond] == { "-" | "=" | "#" | "." | "/" | "\\" | ":" | "~" | "@" | "+" | "^" | NULL
      # note: All bondExpressions are not valid. Stereochemistry should not 
      #       be mixed with the others, as it represents a single bond always.
      #       In addition, "." ("no bond") cannot be mixed with any bond type.
      #       Nothing would be retrieved by "-&=", as a bond cannot be both single
      #       and double. However, "-@" is potentially very useful -- "ring single-bonds"
      #       or "=&!@" -- "doubly-bonded atoms where the double bond is not in a ring"
      # note: Jmol will not match two totally independent molecular pieces. For example,
      #       Jmol will not math [Na+].[Cl-]
      # note: "+" indicates "adjacent biomolecular groups in a chain"
      # note: a bioSEQUENCE ends with "." or the end of the string. A new bioSEQUENCE
      #       can continue with "~" immediately following this "." 
      # note: For a SMARTS search, "." indicates the start of a new subset, not necessarily a
      #       new component.
      # note: "^" indicates atropisomer bond with positive dihedral angle
      
 ######## MEASURES ########
   
   [measure] == { [measureId] | [measureId] ":" [range] | [measureId] ":!" [range] }
   [measureId] == { [measureCode] | [measureCode] [digits] }
   [measureCode == { "d" | "a" | "t" }
   [range] == [minimumValue] { "," | "-" } [maximumValue]
   [minimumValue] == [decimalNumber]
   [maximumValue] == [decimalNumber]

Jmol 3D-SEARCH Definition of "aromatic"

We define "aromatic" here strictly in terms of geometry - a flat ring with trigonal planar geometry for all atoms in the ring. No consideration of bond order is used, because for the sorts of models that can be loaded into Jmol, many do not assume a bonding scheme (PDB, GAUSSIAN, etc.).

Given a ring of N atoms...

                  1
                /   \
               2     6 -- 6a
               |     |
         5a -- 5     4
                \   /
                  3  
with arbitrary order and up to N substituents...
  1. Check to see if all ring atoms have no more than 3 connections. Note: An alternative definition might include "and no substituent is explicitly double-bonded to its ring atom, as in quinone. Here we opt to allow the atoms of quinone to be called "aromatic."
  2. Select a cutoff value close to zero. We use 0.01 here.
  3. Generate a set of normals as follows:
    1. For each ring atom, construct the normal associated with the plane formed by that ring atom and its two nearest ring-atom neighbors.
    2. For each ring atom with a substituent, construct a normal associated with the plane formed by its connecting substituent atom and the two nearest ring-atom neighbors.
    3. If this is the first normal, assign vMean to it.
    4. If this is not the first normal, check vNorm.dot.vMean. If this value is less than zero, scale vNorm by -1.
    5. Add vNorm to vMean.
  4. Calculate the standard deviation of the dot products of the individual vNorms with the normalized vMean.
  5. The ring is deemed flat if this standard deviation is less than the selected cutoff value.

-- Bob Hanson last updated 6/12/2010