Jmol 3D-SEARCH bioSMILES/bioSMARTS

Robert M. Hanson
Department of Chemistry
St. Olaf College
5/19/2010

This document describes a specification for an extension of SMILES and SMARTS for use in 3D molecular atom search and selection as well as biomolecular sequence and cross-link searching. This specification is implemented in Jmol 12.0. It is really a set of specifications:

bioSMILES An extension of SMILES that incorporates both biomolecular sequence/cross-linking information along with more standard molecular or ionic components that utilize an only slightly extended SMILES atom coding.
bioSEQUENCE A component subset of bioSMILES that starts with tilde, "~". The bioSEQUENCE coding allows for extensive searching of biomolecular frameworks. The coding basically substitutes residues for SMILES atoms and cross-linking and base pairing for SMILES "ring" connections. Subsets of bioSEQUENCE include:
- ~p~ protein-only sequence
- ~n~ nucleic-only sequence
- ~r~ rna-only sequence
- ~d~ dna-only sequence
bioSMARTS An extension of SMARTS substructure searching that allows searching of both bioSEQUENCE information and standard SMARTS substructure within SMILES string, bioSMILES strings, and 3D molecular models.
3D-SMARTS A subset of bioSMARTS that allows searching of molecular distance, angle, and torsion measurements.

The org.jmol.smiles package provides extensive functionality for selecting atoms within a three-dimensional model based on SMILES and SMARTS strings. This package may be used independently of Jmol -- see JmolSmilesApplet.java and JmeToJmol.htm.

Besides a presentation of general considerations, a detailed specification for syntax, and the term "aromatic" is defined.

General Considerations

format

Allows for searching "pattern1 or pattern2" and for a variable number of occurances of a pattern within a pattern using [$(...)n] or [$(...)min-max].
Allows any amount of white space -- spaces, tabs, new lines. Prior to parsing, all white space is removed.

Comments in the form //*....*// are allowed anywhere within the string. The following example illustrates the use of comments and white space for the bioSMILES representations of several models:

$ load 1crn.pdb; print {*}.find("SMILES",true)
//* Jmol bioSMILES 12.0.RC19_dev  2010-06-06 14:24 1 *//
~p~TTC:1C:2PSIVARSNFNVC:3RLPGTPEAIC:3ATYTGC:2IIIPGATC:1PGDYAN

$ load 1blu.pdb; print {*}.find("SMILES",true)
//* Jmol bioSMILES 12.0.RC19_dev  2010-06-06 14:24 1 *//
~p~ALMITDECINCDVCEPECPNGAISQGDETYVIEPSLCTECVGHYETSQCVEVCPVDCIIKDPSHEETEDELRAK
  YERITG.
//* FS4 *//[Fe@@]123[S]4[Fe@@]56[S]7[Fe@]84[S]3[Fe@@]97[S]26.
  [CYS.SG#16//* 8 *//]8.[CYS.SG#16//* 53 *//]1.[CYS.SG#16//* 14 *//]9.[CYS.SG#16//* 11 *//]5.
//* FS4 *//[Fe@@]%10%11%12[S]%13[Fe@@]%14%15[S]%16[Fe@]%17%13[S]%12[Fe@@]%18%16[S]%11%15.
  [CYS.SG#16//* 37 *//]%17.[CYS.SG#16//* 18 *//]%10.[CYS.SG#16//* 49 *//]%18.[CYS.SG#16//* 40 *//]%14.
//* HOH *//[O]

$ load 1d66.pdb;print {*}.find("SMILES", true)
//* Jmol bioSMILES 12.0.RC19_dev  2010-06-06 14:24 1 *//
//* chain D dna *// ~d~CCGGAGGACAGTCCTCCGG.
//* chain E dna *// ~d~CCGGAGGACTGTCCTCCGG.
//* chain A protein *// ~p~EQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEV
  ESRLERL.
//* chain B protein *// ~p~EQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEV
  ESRLERL.
//* CD *//[Cd]1234[Cd]567[CYS.SG#16//* 28:B *//]4.[CYS.SG#16//* 11:B *//]15.
  [CYS.SG#16//* 31:B *//]2.[CYS.SG#16//* 21:B *//]7.[CYS.SG#16//* 14:B *//]6.[CYS.SG#16//* 38:B *//]3.
//* CD *//[Cd]89%10%11[Cd]%12%13%14[CYS.SG#16//* 28:A *//]%10.[CYS.SG#16//* 14:A *//]%13.
  [CYS.SG#16//* 31:A *//]9.[CYS.SG#16//* 21:A *//]%14.[CYS.SG#16//* 11:A *//]8%12.[CYS.SG#16//* 38:A *//]%11.
//* HOH *//[O]

$ load 1d66.pdb;calculate hbonds;print {*}.find("SMILES", true)
108 hydrogen bonds
//* Jmol bioSMILES 12.0.RC19_dev  2010-06-06 14:24 1 *//
//* chain D dna *// ~d~C:1C:2G:3G:4A:5G:6G:7A:8C:9A:%10G:%11T:%12C:%13C:%14T:%15
  C:%16C:%17G:%18G:%19.
//* chain E dna *// ~d~C:%19C:%18G:%17G:%16A:%15G:%14G:%13A:%12C:%11T:%10G:9T:8
  C:7C:6T:5C:4C:3G:2G:1.
//* chain A protein *// ~p~E:%20QA:%20C:%21D:%22:%23I:%24C:%25:%26R:%21:%27L:%23
  K:%24K:%27:%25L:%26KCSK:%28EK:%28PKC:%29A:%30K:%31C:%32:%33L:%29:%34K:%30
  N:%31N:%34:%32W:%33E:%35CR:%35:%22YSPKTKRSP:%36LT:%36:%37R:%38A:%39H:%40L:%37:%41
  T:%38:%42E:%39:%43:%44V:%40:%45E:%41:%46S:%43:%42R:%44L:%45:%47:%48E:%46R:%47
  L:%48.
//* chain B protein *// ~p~EQAC:%49D:%50:%51I:%52C:%53:%54R:%49:%55L:%51K:%52
  K:%55:%53L:%54KCSK:%56EK:%56PKC:%57:%58A:%59K:%60C:%57:%61:%62L:%58:%63K:%59
  N:%60N:%61:%63W:%62ECR:%50YSPKTKRSP:%64LT:%64:%65R:%66A:%67H:%68L:%65:%69
  T:%66:%70E:%67:%71V:%68:%72E:%69:%73S:%70:%74R:%71L:%72:%75E:%73R:%74L:%75.
//* CD *//[Cd]%76%77%78%79[Cd]%80%81%82[CYS.SG#16//* 28:B *//]%79.[CYS.SG#16//* 11:B *//]%76%80.
  [CYS.SG#16//* 31:B *//]%77.[CYS.SG#16//* 21:B *//]%82.[CYS.SG#16//* 14:B *//]%81.[CYS.SG#16//* 38:B *//]%78.
//* CD *//[Cd]%83%84%85%86[Cd]%87%88%89[CYS.SG#16//* 28:A *//]%85.[CYS.SG#16//* 14:A *//]%88.
  [CYS.SG#16//* 31:A *//]%84.[CYS.SG#16//* 21:A *//]%89.[CYS.SG#16//* 11:A *//]%83%87.[CYS.SG#16//* 38:A *//]%86.
//* HOH *//[O]

Jmol recognizes "/..../" at the beginning of a pattern as processing flags. The two flags noAromatic and noStereo are defined. "/noAromatic/" turns off all checks for aromaticity, speeding processing when that is not important or no distinction between aromatic and nonaromatic atoms is desired. "/noAromatic,noStereo/" indicates that in addition, no stereochemical checking should be done.

bioSMILES/bioSMARTS

Jmol 12.0 extends SMILES to biomolecular description and SMARTS and searching of both SMILES and three-dimensional models that involve searching sequence and cross-linking information as well as characteristics of molecules such as distance, angle, and torsion ranges. The extension involves only a few simple additions:
1. [residueName.atomName] For selecting specific residue atoms. Wild cards are optional: [ALA.*], [*.*], [*.CA]. The special designation "0" for an atom name, as in [GLY.0], indicates the "lead atom" -- the alpha carbon for proteins or the phosphorus atom in nucleic acids.
2. [residueName.atomName#atomicNumber] The residue/atom specifier may be extended with atomic number information. This allows seasrching a bioSMILES string using SMARTS patterns that only involve standard atom types. In the above example, notice that the connecting atoms to protein chains within the non-bioSEQUENCE component indicates the connections to the protein using this extended notation. Thus, both the actual 3D model and the bioSMARTS string for 1d66 can be searched using the SMARTS search "CdS" as well as the more specific search "Cd[*.SG]".
3. "~" Tilde indicates a bioSEQUENCE string. Subsets include "~p~" protein-only sequence, "~n~" nucleic-only sequence, "~d~" DNA-only sequence, and "~r~" RNA-only sequence. Generally, the string will be standard single-character group symbols. For example: select search("~d~AG") or select search("~p~A[G,P]"). This search returns only the "lead atom" -- the 1-letter code is a stand-in for [XXX.0]. Groups may also be indicated in their standard Jmol notation, [XXX], or bioSMILES notation [XXX.YYY]. These may be mixed. So, for example, ~p~G[ALA]C[ALA.N]. The bioSEQUENCE tilde must be repeated for each new component in the search (separated by ".").
4. "+" and ":" Jmol's extension adds the + and : bond types for sequences. + indicates "connected groups", and ":" indicates "cross-linked groups". Recognized cross-linking includes hydrogen bonding between the purine N1 and pyrimidine N3 in nucleic acids and cysteine-cysteine disulfide bonds in proteins. Cross-linking can be "one-to-one" or "one-to-many". (In the future this could be expanded to carbohydrates.) The standard SMILES branching notation can also be used to represent cross-linking: In normally hybridized DNA, ~C+C+C:G+G+G would be three CG base-pairs (because the two strands are going in opposite direction). The default for bioSEQUENCE connections is "+", so this search could also be represented simply as ~CCC:GGG. "+" is only necessary when the search is not a bioSEQUENCE. For example, {[CYS.CA]+[PRO.N]} allows selection of the alpha carbon on one residue and the backbone nitrogen atom on the next.
5. branches and rings Jmol's extension still allows for standard SMILES branching notation using parentheses. Within bioSEQUENCES, branching indicates cross-linking. Thus, above CG stretch could be indicated C(G)C(G)C(G). An empty branch, C(), indicates "not cross-linked" -- in this case a cysteine without a disulphide bond or a cytidine that is not base-paired. For bioSMILES, which shows all cross-linking explicitly, the empty parentheses may be omitted. Ring notation can also be used: C:1CC(GGG:1) ensures that the hybridization is as a true set of CG pairs. In order to allow for more than 99 branches (or base pairs), we extend the ring % notation by enclosing the number in parentheses. For example, a section of the large ribosomal unit for PDB code 1FFK reads:
```
//* chain 9 rna *// ~r~UUAG:%(792)G:%(793)C:%(794)G:%(795)G:%(796)C:%(797)CAC
  AG:%(798)C:%(799)G:%(800)G:%(801)U:%(802)G:%(803)G:%(804)G:%(805)GUUGCCUC:%(806)
  C:%(807)C:%(808)G:%(809)U:%(810)ACCC:%(811)AUCCCG:%(811)AACA:%(810)C:%(809)
  G:%(808)G:%(807)AAG:%(806)AU:%(812)AA:%(812)GC:%(805)C:%(804)C:%(803)A:%(802)
  C:%(801)C:%(800)AG:%(799)C:%(798)GUUC:%(813)C:%(814)G:%(815)G:%(816)G:%(817)
  GAGUAC:%(818)U:%(819)G:%(820)G:%(821)A:%(822)G:%(823)UG:%(824)C:%(825)GCG
  AG:%(825)C:%(824)C:%(823)U:%(822)C:%(821)U:%(820)G:%(819)G:%(818)GAAAC:%(817)
  C:%(816)C:%(815)G:%(814)G:%(813)UUCG:%(797)C:%(796)C:%(795)G:%(794)C:%(793)
  C:%(792)ACC.
```
6. measurements An additional extension, consisting of a measurement type (d - distance, a - angle, or t - torsion) allows SMARTS searches to target specific ranges of values.
7. processing flags Jmol recognizes "/..../" at the beginning of a pattern as processing flags. Currently the only flags supported are "noAromatic" and "noStereo". "noAromatic" turns off all aromaticity checks. It may be desireable when no distinction between aromatic and nonaromatic atoms is desired. For large biomolecules /noAromatic/ can dramatically improve processing speed when no check for aromaticity is necessary. All atoms are then considered NOT aromatic. "noStereo" turns off all stereochemical checking.

aromaticity

Jmol 3D-SEARCH defines "aromatic" unambiguously and strictly geometrically. see below.
Note that "aromatic" is not restricted to any specific subset of elements.
For large biomolecule searches, the search for aromatic rings can be time consuming and unnecessary. Adding the flag "/noAromatic/" at the beginning of the search pattern will turn off all checks for aromaticity and may dramatically increase processing speed.

Comparision to Daylight SMILES

All single-component aspects of Daylight SMILES are implemented, including aromaticity and atom- and bond-based stereochemistry ("chirality").

Comparision to Daylight SMARTS

[H1] interpreted as [*H1] -- "an atom with one connected H atom".
Allows definition of [$XXX] variables:
```
      Var x = '$R1="[CH3,NH2]";$R2="[OH]"; {a}[$R1]' // select aromatic atoms attached to CH3 or NH2  
      select within(SMARTS,@x)
```
Note that these variables are any string whatsoever, not just atom sets. The syntax is simply:
- Each variable definition takes the form $ [name] =" [definition] " [comments] ;
- [name] can be any characters except '$', '=', and ']' and must not start with '('. It is recommended they be restricted to the set A-Z, a-z, and 0-9.
- [definition] can be any valid SMARTS characters.
- [comments] can be any characters other than ';'.
- The actual pattern starts after the last variable definition.
- Nested variables are allowed, but note that this may require using the recursion syntax, $(...):
```
      Var x = '$R1="[CH3,NH2]";$R2="[$($R1),OH]"; {a}[$R1]' // select aromatic atoms attached to CH3, NH2, or OH  
      select within(SMARTS,@x)
```
- For $xxx="yyyy", all occurrances of the string "[$xxx]" are replaced within the pattern prior to parsing.

Implements nested ("recursive") SMARTS:

 
      Var x = '$R1="[CH3,NH2]";$R2="[OH]";  {a}[$([$R1]),$([$R2])]' // aromatic attached to CH3, NH2, or OH
      select within(SMARTS,@x)

Note that $(...) need not be within [...], and wherever it is, it always means "just the first atom".

primitives

All Daylight SMARTS primitives are implemented. These include:

[Element]	capitalized - standard notation Na, Si, etc. -- specific non-aromatic atom
[element]	uncapitalized - specific aromatic atom (as for standard notation, no limitations)
*	any atom
A	any non-aromatic atom
a	any aromatic atom
#	atomic number
(integer)	mass number -- Note, however, that [H1] is [*H1], "any atom with one attached hydrogen", not unlabeled hydrogen, [1H].
D	degree - total number of connections
H	exact hydrogen count
h	"implicit" hydrogen count (atoms are not in structure)
R	in the specified number of rings
r	in ring of a given size
v	valence (total bond order)
X	calculated connectivity, including implicit hydrogens
x	number of ring bonds
@	stereochemistry

In addition, Jmol 3D-SMARTS adds the following primitives and options:

d	non-hydrogen degree -- number of non-hydrogen connections
=	Jmol atom index, for example: [=23]
[number]?	mass number or undefined (so, for example, [C12?] means any carbon that isn't explicitly C13 or C14
[$(pattern)n]	A specific number of occurances of pattern. For example, C[$(C=C)3]C is synonymous with CC=CC=CC=CC.
[$(pattern)min-max]	A variable number of occurances of pattern. For example: A[$(C:G)0-2]A is synonymous with AA or AC(:G)A or AC(:G)C(:G)A.
pattern1 \|\| pattern2	"\|\|" indicates "or" and allows searching for multiple patterns, which may overlap. For example: select search("c{O} \|\| c{C}"). Note that the "\|\|" syntax is an alternative to using "[,]", in this case being equivalent to (and slightly slower than) select search("c{[O,C]}").
(.measure)	The extension capitalizes on the fact that in a standard SMARTS string, period "." cannot ever appear immediately following an open parenthesis "(". Using this fact, the format involves the following: "(." [single character type - "d" (distance), "a" (angle), or "t" (torsion)] [optional numeric identifier] ":" [optional "!" (not)] [minimum value] { "," \| "-" } [maximum value] ")" This extension must appear immediately following an element symbol or a bracketed atom expression. The separators "," or "-" between minimum and maximum values are equivalent. For example, the following will find all aliphatic carbon-carbon bonds that are between 1.5 and 1.6 angstroms long. select search("C(.d:1.5-1.6)C") The following will select for all trans-diaxial methyl groups on a cyclohexane ring, finding all torsions that are outside the range -160 to 160 degrees: select search("[CH3](.t:!-160,160)CC[CH3]") The default in terms of specifying which atoms are involved is simply "the next N-1 atoms," where N is 2, 3, or 4. For more complicated patterns, one can designate the specific atoms in the measurement using a numeric identifier after the measurement type. The following will target the bond angle across the carbonyl group in the backbone of a peptide: select search("[.CA](.a1:105-110)C(.a1)(=O)N(.a1)") Designations can overlap; one simply adds whatever (.xn) designation is wanted after the desired atoms: select search("C(.a1:105,108)C(.a1)(.a2:110,130)C(.a1)(.a2)C(.a2)") In Jmol, this capability is extended to the measure* command for easy access to SMARTS-based measurements: select * measure search("C(.a1:110,130)C(.a1)(=O)C(.a1)") Note that the atoms in no way have to be connected. The only restriction is that the three markers for an angle or the four markers for a torsion will be identified in order from left to right within the SMARTS string. The following, for example, will find all carbonyl oxygen atoms that are within 5 angstroms of each other: select search("{O}(.d1:0,5)=C.{O}(.d1)=C") The "." here indicating "not bonded." {O} specifies that although we want to find the entire set, we only want to select the oxygen atoms. The close of the selection brace may appear before or after the (.x) designation.

Jmol bioSMILES adds the following primitives and options:

[residue.atomName#atomicNumber]	residue and atom name, with optional atomic number, for example [CYS.SG#16] or [ALA.CA]. "0" for atomName indicates the "lead" atom -- for nucleic acids the phosphorus atom (or in some cases a terminal oxygen or hydrogen atom), and for proteins the alpha carbon. and .
~...~...	bioSEQUENCE using single-letter or [RES] codes.
%(n)	ring branching where n may be larger than 99.

Jmol bioSMARTS adds the following primitives and options:

[.ATOMNAME], [RESIDUE.], [.]	Wildcards for residues and atom names
[RES.ATOMNAME]+[RES.ATOMNAME]	atoms in adjacent residues, for example [ALA.CA]+[GLY.N]
[RES.ATOMNAME]:[RES.ATOMNAME]	atoms in cross-linked residues, for example [CYS.CA]:[CYS.CA]
~...~...	bioSEQUENCE notation using single-letter or [RES] codes, including logic: select search("~A:[C,T]")

All primitives that are not element names, *, A, or a must be enclosed in brackets. In addition, the following elements must be enclosed in brackets because their two-letter combination Xy implies the non-aromatic element X attached to the aromatic element y: Ac, Ba, Ca, Na, Pa, and Sc.
Allows any order of bracketed primitives: [H2C13] same as [13CH2].
All atom and bond logic implemented: [X,!X,X&X,X&X&X;X&X]-,=X
"&" is optional: [13CH2] same as [13&C&H2] except in cases of ambiguity with element symbols: [Rh] is rhodium, not [R&h].
Jmol 3D-SEARCH does NOT implement:
- "zero-level parentheses", since the match is always only within a given model (but note that you can still use "." to indicate that the two search sections are not connected.
- "?" in atom stereochemistry ("chirality") because 3D structures are always defined stereochemically.
- "?" for bond stereochemistry, as 3D structures are always defined stereochemically

Jmol atom selection

Then general way within Jmol to select atoms based on SMARTS searches is to use select search("..."). To assign variables to the results of a search, use the find() command.
To select one or more atoms within the found pattern, simply enclose the desired atoms in { }: select search("{C}C=O"), for example, returns all alpha carbons, and select search("~d~G{C}A") returns all DNA cytidines that are in GCA sequences. For SMILES searches, all hydrogen atoms -- as in HCCC or [CH2] -- are selected. This includes all hydrogens needed to complete the "normal" valence of an unbracketed atom such that "CCC" is the same as "[CH3][CH2][CH2]".
For SMARTS searches, no valence calculation is done to add any additional hydrogens to unbracketed atoms. "CCC" is the same as "[C][C][C]". only unbracketed or bracketed hydrogen atoms such as H[C]C or [H] or [2H] are selected; connected hydrogen atoms as in [CH3] are not selected.
For bioSMARTS searches, bioSEQUENCE single-letter codes match the lead atom only for each residue, thus giving a count of the groups found. If it is desired to select all atoms in the selected groups, use select within("group",search("...")).

implicit hydrogen count

The primitives h (implicit hydrogen count) and X (total connections, including implicit hydrogens) require analysis of bonding around a model atom to determine the number of missing ("implicit") hydrogen atoms based on a "target valence." Models that specify only "aromatic" or "partial" bonding may produce ambiguous results, and for that reason, primitives X and h are not recommended for use. Other primitives, such as D, d, and v should be more useful. The analysis Jmol uses here is the same as for how Jmol calculates the number of hydrogens to add for the calculate hydrogens command and includes:
1. Assign the target valence TV as follows:
  - For C and Si, TV = 4.
  - For B, N, and P, TV = 3.
  - For O and S, TV = 2.
  - For F, Cl, Br, and I, TV = 1.
  - For all other atoms, TV = 0.
2. Obtain the formal charge on the atom, C.
3. Group IV elements such as carbon are unique, in that their cations are valence-poor, not valence-rich. So for carbon and silicon, subtract the ABSOLUTE VALUE of C from the target valence. In all other cases, let TV = TV + C.
4. Determine the overall valence of the atom, OV. This is calculated by adding up all the bond orders to the atom.
5. Subtract OV from TV to get the number of implicit hydrogen atoms. If this number is less than zero, assign zero.
Thus, the implicit hydrogen count is:
- 0 for all atoms other than {B,C,N,O,P,Si,S}
- 0 for BR3
- 0 for CR4, 1 for CR3, 2 for CR2, 3 for CR
- 0 for CR3(+), 0 for CR3(-)
- 0 for R=CR2, 1 for R=CR, 2 for R=C, 1 for C#R (triple bond)
- 0 for NR3, 1 for NR2, 2 for NR
- 0 for RN=R, 1 for R=N
- 1 for NR3(+), 1 for R=NR(+), 1 for RN(-)
- 0 for OR2, 0 for O=R, 1 for OR
- 0 for RO(-), 2 for RO(+)

Detailed Jmol SMILES/bioSMILES Specification

 
      # note: prior to parsing, all white space is removed
       
   [smilesDef] == [preface] [smiles]
   [preface] == { [flagDefs] | NULL } 
   [flagDefs] == { [flagDef] || [flagDef] [flagDefs] }
   [flagDef] == "/" [processingFlags] "/"
   [processingFlags] == { [processingFlag] | [processingFlag] [processingFlags] }
   [processingFlag] == { "noAromatic" | "noStereo" } (case-insensitive)
      # note: the noAromatic flag indicates to not distinguish between
      #       aromatic/aliphatic searches -- "C" and "c"
      # note: the noStereo flag turns off all stereochemical testing
      # note: thus, both "/noAromatic//noStereo/" and "/noAromatic noStereo/" are valid 
   [smiles] == { [entity] | [entity] "." [entity] }
   [entity] == { [bioSequence] | [molecularSequence] }
   [molecularSequence] = [node][connections] 
   [node] == { [atomExpression] | [connectionPointer] }

   [atomExpression] = { [unbracketedAtomType] 
                             | "[" [bracketedExpression] "]" }
   
   [unbracketedAtomType] == [atomType] 
                                 & ! { "Ac" | "Ba" | "Ca" | "Na" | "Pa" | "Sc"
                                     | "ac" | "ba" | "ca" | "na" | "pa" | "sc" }
      # note: Brackets are required for these elements: [Na], [Ca], etc.
      #       These elements Xy are instead interpreted as "X" "y", a single-letter
      #       element followed by an aromatic atom. 
      
   [atomType] == { [validElementSymbol] | [aromaticType] }
   [validElementSymbol] == (see Elements.java; 
                            including Xx and only through element 109)
   [aromaticType] == { [validElementSymbol].toLowerCase() }
       
   [bracketedExpression] == { "[" [atomPrimitives] "]" } 
   
   [atomPrimitives] == { [atom] | [atom] [atomModifiers] }
   [atom] == { [isotope] [atomType] | [atomType] } 
   [isotope] == [digits]
       # note -- isotope mass must come before the element symbol. 
   [digits] == { [digit] | [digit] [digits] }
   [digit] == { "0" | "1" | "2" | "3" | "4" | "5" | "6" | 7" | "8" | "9" }
   [atomModifiers] == { [atomModifier] | [atomModifier] [atomModifiers] }
   [atomModifier] == { [charge] | [stereochemistry] | [H_Prop] }
   [charge] == { "-" [digits] | "+" [digits] | [plusSet] | [minusSet] }
   [plusSet] == { "+" | "+" [plusSet] }
   [minusSet] == { "-" | "-" [minusSet] }
   [stereochemistry] == { "@"           # anticlockwise
                              | "@@"    # clockwise
                              | "@" [stereochemistryDescriptor] 
                              | "@@" [stereochemistryDescriptor] }
   [stereochemistryDescriptor] == [stereoClass] [stereoOrder]
   [stereoClass] == { "AL" | "TH" | "SP" | "TP" | "OH" }
   [stereoOrder] == [digits]
   
   [connectionPointer] == { "%" [digit][digit] | [digit] | "%(" [digits] ")"}
      # note: all connectionPointers must have a second matching connectionPointer 
      #       and must be preceded by an atomExpression for the 
      #       first occurance and either an atomExpression or a bond
      #       for the second occurance
      # note: Jmol bioSMARTS extends the possible number of rings to > 100 by 
      #       allowing %(n)

   [connections] == [connection] | NULL }
   [connection] == { [branch] | [bond] [node] } [connections]
   [branch] == { "(" { [smiles] | [bond] [smiles] } ")" | "()" }
      # note: empty parentheses "()" are ignored in SMILES and bioSMILES
   [bond] == { "-" | "=" | "#" | "." | "/" | "\\" | ":" | NULL
      # note: Jmol will not match two totally independent molecular pieces. For example,
      #       Jmol will not math [Na+].[Cl-]. However, "." can be used to clarify a
      #       structure that has "ring" bond notation:
      #       CC1CCC.C1CC   is a valid structure.
      # note: bioSEQUENCE uses ":" to indicate "cross-linked", which is the default for branches

   [bioSequence] == [bioCode] [bioNode] [connections]
   [bioCode] == { "~" | "~" [bioType] "~" }
      # note: The "~" must be the first character in a component and must be repeated 
      #       for each component (separated by ".")
   [bioType] == { "p" | "n" | "r" | "d" }
      # note: protein, nucleic, RNA, DNA
   [bioNode] == { "[" [bioResidueName] "." [bioAtomName] "]" 
                 | "[" [bioResidueName] "." [bioAtomName] "#" [atomicNumber] "]" 
                 | [bioResidueCode] } 
   [atomicNumber] == [digits]
   [bioResidueName] == { "ARG" | "GLY" ... } (case-insensitive) 
   [bioAtomName] == {"C" | "CA" | "N" ... } (case-insensitive)
   [bioResidueCode] == { "A" | "R" | "G" ... } (case-sensitive)
      # note: In a BioSEQUENCE, residues are designated using standard 1-letter-code group names
      #       or bracketed residues [xxx] with optional atoms specified: [ARG], [CYS.SG].

Detailed Jmol 3D-SMARTS/bioSMARTS Specification

 

 ######## GENERAL ########

      # note: prior to parsing, all white space is removed

   [smartDef] == [preface] [smartsSet]
   [preface] == { [flagDefs] [variableDefs] | [variableDefs] | NULL } 
   [flagDefs] == { [flagDef] || [flagDef] [flagDefs] }
   [flagDef] == "/" [processingFlags] "/"
   [processingFlags] == { [processingFlag] | [processingFlag] [processingFlags] }
   [processingFlag] == { "noAromatic" | "noStereo" } (case-insensitive)
      # note: the noAromatic flag indicates to not distinguish between
      #       aromatic/aliphatic searches -- "C" and "c"
      # note: the noStereo flag turns off all stereochemical testing
      # note: thus, both "/noAromatic//noStereo/" and "/noAromatic noStereo/" are valid 
   [variableDefs] == [variableDef] | [variableDef] [variableDefs]
   [variableDef] ==  "$" [label] "=" "\"" [smarts] "\"" [comments] ";"
   [label] == [any characters other than "=" and "$", and not starting with "("]
   [comments] == [any characters other than ";"]
      # note: Variable definitions must be parsed first. 
      #       After that, all variable references [$XXXX] are replaced
      
   [smartsSet] == { [smarts] | [smarts] "||" [smartsSet] }
      # note: Jmol adds the "or" operation "||", for example: "C=O || C=N"
      #       which, in this case, could also be written as "C=[O,N]"
      #       Jmol preprocesses these sets, evaluates them independently, and then
      #       combines them.
      
   [smarts] == { [node3D] [connections] | [bioSequence] } 
   [connections] == [connection] | NULL }
   [connection] == { [branch] | [bondExpression] [node3D] } [connections]
   [branch] == { "(" { [smarts] | [bondExpression] [smarts] } ")" | "()" }
      # note: Default bonding for a branch is single for SMARTS or cross-linked (:) for bioSEQUENCE
      # note: "()" is ignored in SMARTS and indicates "not cross-linked" in bioSEQUENCE
   
 ######## ATOMS ########
    
   [node3D] == { [atomExpression] | [atomExpression] "(." [measure] ")" | [connectionPointer] }
   [atomExpression] = { [unbracketedAtomType]
                             | [bracketedExpression] 
                             | [multipleExpression]
                             | [nestedExpression] }
   
   
   [unbracketedAtomType] == [atomType] 
                                 & ! { "Ac" | "Ba" | "Ca" | "Na" | "Pa" | "Sc"
                                     | "ac" | "ba" | "ca" | "na" | "pa" | "sc" }
      # note: Brackets are required for these elements: [Na], [Ca], etc.
      #       These elements Xy are instead interpreted as "X" "y", a single-letter
      #       element followed by an aromatic atom. 
      # note: in a bioSEQUENCE, all atom types are 1-letter code group names
      
   [atomType] == { [validElementSymbol] | "A" | [aromaticType] | "*" }
   [validElementSymbol] == (see Elements.java; 
                            including Xx and only through element 109)
   [aromaticType] == { "a" | [validElementSymbol].toLowerCase() }
       
   [bracketedExpression] == "[" { [atomOrSet] | [atomOrSet] ";" [atomAndSet] } "]" 
   
   [atomOrSet] == { [atomAndSet] | [atomAndSet] "," [atomAndSet] }
   [atomAndSet] == { [atomPrimitives] | [atomPrimitives] "&" [atomAndSet]
                              | "!" [atomPrimitive] 
                              | "!" [atomPrimitive] "&" [atomAndSet] }
                              
 ######## ATOM PRIMITIVES ########

   [atomPrimitives] == { [atomPrimitive] | [atomPrimitive] [atomPrimitives] }
       # note -- if & is not used, certain combinations of primitiveDescritors
       #         are not allowed. Specifically, combinations that together
       #         form the symbol for an element will be read as the element (Ar, Rh, etc.)
       #         when NOT followed by a digit and no element has already been defined 
       #         So, for example, [Ar] is argon, [Ar3] is [A&r3], [ORh] is [O&R&h],  
       #         but [Ard2] is [Ar&d2] -- "argon with two non-hydrogen connections"
       #         Also, "!" may not be use with implied "&". 
       #         Thus, [!a], [!a&!h2], and [h2&!a] are all valid, but [!ah2] is invalid.             
   [atomPrimitive] == { [isotope] | [atomType] | [charge] | [stereochemistry]
                              | [a_Prop] | [A_Prop] | [D_Prop] | [H_Prop] | [h_Prop] 
                              | [R_Prop] | [r_Prop] | [v_Prop] | [X_Prop]
                              | [x_Prop] | [nestedExpression] }
   [isotope] == [digits] | [digits] "?"
       # note -- isotope mass may come before or after element symbol, 
       #         EXCEPT "H1" which must be parsed as "an atom with a single H" 
   [digits] == { [digit] | [digit] [digits] }
   [digit] == { "0" | "1" | "2" | "3" | "4" | "5" | "6" | 7" | "8" | "9" }
   [charge] == { "-" [digits] | "+" [digits] | [plusSet] | [minusSet] }
   [plusSet] == { "+" | "+" [plusSet] }
   [minusSet] == { "-" | "-" [minusSet] }
   [stereochemistry] == { "@"           # anticlockwise
                              | "@@"    # clockwise
                              | "@" [stereochemistryDescriptor] 
                              | "@@" [stereochemistryDescriptor] }
   [stereochemistryDescriptor] == [stereoClass] [stereoOrder]
   [stereoClass] == { "AL" | "TH" | "SP" | "TP" | "OH" }
   [stereoOrder] == [digits]
       # note -- "?" here (unspecified) is not relevant in 3D-SEARCH 
   
   [A_Prop] == "#" [digits]           # elemental atomic number
   [a_Prop] == "=" [digits]           # atom index (starts with 0)
   [D_Prop] == { "D" [digits] | "D" } # degree -- total number of connections 
                                      #   excludes implicit H atoms; default 1
   [d_Prop] == { "d" [digits] | "d" } # degree -- non-hydrogen connections
                                      #   default 1 
   [H_Prop] == { "H" [digits] | "H" } # exact hydrogen count 
                                      #   excludes implicit H atoms
   [h_Prop] == { "h" [digits] | "h" } # implicit hydrogens -- "h" indicates "at least one"
                                      #   (see note below)
   [R_Prop] == { "R" [digits] | "R" } # ring membership; e.g. "R2" indicates "in two rings"
                                      #   "R" indicates "in a ring" 
                                      #   !R" or "R0" indicates "not in any ring"
   [r_Prop] == { "r" [digits] | "r" } # in ring of size [digits]; "r" indicates "in a ring"
   [v_Prop] == { "v" [digits] | "v" } # valence -- total bond order (counting double as 2, e.g.)
   [X_Prop] == { "X" [digits] | "X" } # connectivity -- total number of connections
                                      #   includes implicit H atoms
   [x_Prop] == { "x" [digits] | "x" } # ring connectivity -- total ring connections
   
 ######## Nested and Multiple Expressions ########
 
   [nestedExpression] == "$(" [atomExpression] ")"
      # note: nestedExpressions return only the first atom as a match, 
      #       not all atoms in the expression.

   [multipleExpression] == { "[$(" [orExpression] ")" [nTimes] "]" 
                             | "[$(" [orExpression] ")" [nMinimum] "-" [nMaximum] "]" 
                             | "[$(" [orExpression] "|" [orExpression] "]" 
                             | "[$(" [orExpression] "||" [orExpression] "]" }
   [orExpression] = { [atomExpression] 
                       | [atomExpression "|" [orExpression] 
                       | [atomExpression "||" [orExpression] }
      # note: "|" and "||" are synonymous in this inner context; "|" is preferred simply
      #       for readability (whereas "||" is required for the [smartsSet] context). 
      # note: This syntax is carefully written to exclude [$(xxx)] by itself, which
      #       is a nestedExpression, not a multipleExpression. The difference is that
      #       the nestedExpression only returns the first atom, while the multipleExpression
      #       returns all atoms. To return only the first atom within this context 
      #       it is necessary to use a nested expression within the multiple expression.
      #       For example: "CC[$( $(C=O) | $(C=N) )2]"
      #       is the same as "CC$(C=[O,N])$(C=[O,N])", although Jmol preprocesses it as
      #          "CC$(C=O)$(C=O)||CC$(C=O)$(C=N)||CC$(C=N)$(C=O)||CC$(C=N)$(C=N)"
      
   [nTimes] == [digits]
   [nMinimum] == [digits]
   [nMaximum] == [digits]
      # note: multipleExpressions allow for searching a given number of expressions or 
      #       a variable number of expressions (including 0, perhaps)
      #       Jmol pre-processes these expressions and turns them into a set:
      #       pattern1 || pattern2 || pattern3....

 ######## BioSEQUENCE ########

   [bioSequence] == [bioCode] [bioNode] [connections]
   [bioCode] == { "~" | "~" [bioType] "~" }
      # note: The "~" must be the first character in a component and must be repeated 
      #       for each component (separated by ".")
   [bioType] == { "p" | "n" | "r" | "d" }
      # note: protein, nucleic, RNA, DNA
   [bioNode] == { "[" [bioResidueName] "]" 
                 | "[" [bioResidueName] "." [bioAtomName] "]" 
                 | "[" [bioResidueName] "." [bioAtomName] [A_Prop] "]" 
                 | [bioResidueCode] } 
   [bioResidueName] == { "*" | "ARG" | "GLY" ... } (case-insensitive) 
   [bioAtomName] == { "*" | "0" | "C" | "CA" | "N" ... } (case-insensitive)
      # note: "0" indicates the "lead atom":
      #   nucleic: P if present, or H5T if present, or O5'/O5*
      #   protein: CA
      #   carbohydrate: the first atom of the group listed in the model file
   [bioResidueCode] == { "*" | "A" | "R" | "G" ... } (case-sensitive)
      # note: wildcard or standard group 1-letter-code
      #       or, in the case of RNA or DNA:
      #         "N" (any residue; same as "*"), 
      #         "R" (any purine -- A or G)
      #         "Y" (any pyrimidine -- C or T or U)

 ######## CONNECTIONS (aka "rings") ########

   [connectionPointer] == { [digit] | "%" [digit][digit] | "%(" [digits] ")" }
      # note: All connectionPointers must have a second matching connectionPointer 
      #       and must be preceded by an atomExpression for the 
      #       first occurance and either an atomExpression or a bondExpression
      #       for the second occurance. The matching connectionPointers may be
      #       in different "components" (separated by "."), in which case they
      #       represent general connections and not necessarily rings.

 ######## BONDS ########

   [bondExpression] == { [bondOrSet] | [bondOrSet] ";" [bondAndSet] } 
   
   [bondOrSet] == { [bondAndSet] | [bondAndSet] "," [bondAndSet] }
   [bondAndSet] == { [bondPrimitives] | [bondPrimitives] "&" [bondAndSet]
                              | "!" [bondPrimitive] 
                              | "!" [bondPrimitive] "&" [bondAndSet] }
                                              
 ######## BOND PRIMITIVES ########
                              
   [bondPrimitives] == { [bondPrimitive] | [bondPrimitive] [bondPrimitives] }       
   [bond] == { "-" | "=" | "#" | "." | "/" | "\\" | ":" | "~" | "@" | "+" | "^" | NULL
      # note: All bondExpressions are not valid. Stereochemistry should not 
      #       be mixed with the others, as it represents a single bond always.
      #       In addition, "." ("no bond") cannot be mixed with any bond type.
      #       Nothing would be retrieved by "-&=", as a bond cannot be both single
      #       and double. However, "-@" is potentially very useful -- "ring single-bonds"
      #       or "=&!@" -- "doubly-bonded atoms where the double bond is not in a ring"
      # note: Jmol will not match two totally independent molecular pieces. For example,
      #       Jmol will not math [Na+].[Cl-]
      # note: "+" indicates "adjacent biomolecular groups in a chain"
      # note: a bioSEQUENCE ends with "." or the end of the string. A new bioSEQUENCE
      #       can continue with "~" immediately following this "." 
      # note: For a SMARTS search, "." indicates the start of a new subset, not necessarily a
      #       new component.
      # note: "^" indicates atropisomer bond with positive dihedral angle
      
 ######## MEASURES ########
   
   [measure] == { [measureId] | [measureId] ":" [range] | [measureId] ":!" [range] }
   [measureId] == { [measureCode] | [measureCode] [digits] }
   [measureCode == { "d" | "a" | "t" }
   [range] == [minimumValue] { "," | "-" } [maximumValue]
   [minimumValue] == [decimalNumber]
   [maximumValue] == [decimalNumber]

Jmol 3D-SEARCH Definition of "aromatic"

We define "aromatic" here strictly in terms of geometry - a flat ring with trigonal planar geometry for all atoms in the ring. No consideration of bond order is used, because for the sorts of models that can be loaded into Jmol, many do not assume a bonding scheme (PDB, GAUSSIAN, etc.).

Given a ring of N atoms...

                  1
                /   \
               2     6 -- 6a
               |     |
         5a -- 5     4
                \   /
                  3

with arbitrary order and up to N substituents...

Check to see if all ring atoms have no more than 3 connections. Note: An alternative definition might include "and no substituent is explicitly double-bonded to its ring atom, as in quinone. Here we opt to allow the atoms of quinone to be called "aromatic."
Select a cutoff value close to zero. We use 0.01 here.
Generate a set of normals as follows:
1. For each ring atom, construct the normal associated with the plane formed by that ring atom and its two nearest ring-atom neighbors.
2. For each ring atom with a substituent, construct a normal associated with the plane formed by its connecting substituent atom and the two nearest ring-atom neighbors.
3. If this is the first normal, assign vMean to it.
4. If this is not the first normal, check vNorm.dot.vMean. If this value is less than zero, scale vNorm by -1.
5. Add vNorm to vMean.
Calculate the standard deviation of the dot products of the individual vNorms with the normalized vMean.
The ring is deemed flat if this standard deviation is less than the selected cutoff value.

-- Bob Hanson last updated 6/12/2010