Harnessing Python for Computational Chemistry: From SMILES to 3D
Written on
Chapter 1: Understanding SMILES in Computational Chemistry
When we mention "smiles" in a chemical context, we aren't referring to the joyful expressions often captured in photographs. Instead, we delve into an essential aspect of computational chemistry. This field leverages computational tools for intricate chemical analyses, such as drug design. With advancements like AlphaFold and quantum computing, computational chemistry is experiencing a remarkable resurgence.
SMILES, short for Simplified Molecular Input Line Entry System, employs a character-based notation to depict the structure of chemical compounds using concise ASCII strings. Many molecular editors can interpret SMILES strings, converting them into two-dimensional representations, even though these molecules exist in three dimensions in reality. This approach is particularly beneficial, as it enables the use of cutting-edge machine learning techniques rooted in string manipulation and natural language processing. Below is an example of how chemical structures can be represented in SMILES format: e.g., CC(=O)OC1=CC=CC=C1C(=O)O and CNC[C@@H](C1=CC(=C(C=C1)O)O)O.
As previously mentioned, most editors, such as ChemDraw, convert SMILES notation into 2D visualizations despite the three-dimensional nature of molecules. Fortunately, the character information allows advanced software to recreate the complete 3D model, which is the primary focus of this article.
Section 1.1: Converting SMILES to .mol Files
Among the prevalent 3D file formats are .pdb (Protein Data Bank) and the simpler .mol. Utilizing Python libraries like RDKit alongside free tools such as PyMOL, we will explore the .mol file format. Initially, we will examine how to transform SMILES notation into a .mol file format, followed by an analysis of the contents of a .mol file, and conclude with a visualization using PyMOL.
The following code snippet imports various libraries from the RDKit package, generates a 2D representation (as illustrated above), and ultimately saves the molecule in a .mol file. A notebook executing this code can be accessed here.
The crucial function used is "Chem.MolFromSmiles()", which interprets a SMILES string.
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
# Convert SMILE into .mol
my_mol = Chem.MolFromSmiles('NC(=N)N1CCC[C@H]1Cc2onc(n2)c3ccc(Nc4nc(cs4)c5ccc(Br)cc5)cc3')
my_mol_with_H = Chem.AddHs(my_mol)
AllChem.EmbedMolecule(my_mol_with_H)
AllChem.MMFFOptimizeMolecule(my_mol_with_H)
my_embedded_mol = Chem.RemoveHs(my_mol_with_H)
# Save the molecule as an image
Draw.MolToFile(my_mol, 'molecule.png')
# Save molecular representation in mol files
fout = Chem.SDWriter('./charged_test.mol')
fout.write(my_embedded_mol)
fout.close()
Given that RDKit is a Python library, it seamlessly integrates with machine learning, natural language processing, and other AI frameworks in Python.
Subsection 1.1.1: Anatomy of a .mol File
A .mol file is essentially an ASCII text file that contains various elements arranged in a space-separated format, which includes:
- A list of atoms, each defined by its elemental identity.
- A list of bonds indicating which atoms are connected and the bond type (single, double, triple).
- 2D or 3D spatial coordinates for every atom.
- The total number of atoms and bonds in the molecule.
- Attributes associated with individual atoms or bonds.
- Attributes relevant to the entire structure.
These elements are typically organized in columns or blocks, as illustrated in the accompanying diagram. The initial columns represent the 3D coordinates of the atoms, followed by the atomic symbols, attributes, and bond values. For further details on this structure, refer to "Anatomy of a MOL file" by Robert Belford.
The resulting .mol file, generated through RDKit, can be visualized with PyMOL as shown below:
Chapter 2: Practical Applications in Molecular Visualization
This video titled "Python for Computational Chemistry - Beginners Tutorials - Introduction" offers a fundamental overview for those new to using Python in computational chemistry.
The next video, "Quantum Chemistry Calculations with Python: S2 - DFT Basics," delves deeper into quantum chemistry calculations, providing a solid foundation for understanding DFT (Density Functional Theory).
In conclusion, this article has illustrated straightforward methods for recovering a 3D molecular structure from SMILES notation using both Python scripts and visualization tools. This demonstrates the ongoing relevance of SMILES notation, even if it may appear less readable. Graphical representations of drugs can also be utilized in machine learning contexts, despite challenges in orienting the molecule within three-dimensional space. While SMILES notation may seem cumbersome to human readers, it takes advantage of advancements in natural language processing within machine learning, as evidenced by numerous recent studies.
References
- PyMOL | pymol.org
- RDKit
- ChemDraw
- Anatomy of a MOL file
- Zhang et al., Frontiers in Chemistry 2020, "SPVec: A Word2vec-Inspired Feature Representation Method for Drug-Target Interaction Prediction"
- Anatomy of PDB file
Connect with me on:
- Twitter: @Dr_Alex_Crimi
- Facebook: Alessandro Crimi