rdcanon is a package designed for canonicalizing SMARTS and Reaction SMARTS templates. It reorders SMARTS to optimize querying speed. This optimization is invariant of atom mapping.
- Ensure you have rdkit installed (version > 2023.9.2).
- The following packages will be installed: 'rdkit > 2023.09.1', 'matplotlib', 'lark', 'numpy', 'networkx', 'scikit-learn', (optional, for kde generation) 'ipykernel', 'pandas', 'openpyxl'
- Create or activate a virtual environment.
- Clone the repository.
- Install the package with the command:
pip install -e rdcanon
To sanitize individual SMARTS:
from rdcanon import canon_smarts
test_smarts = [
"[$([NX3H,NX4H2 ]),$([NX3](C)(C)(C))]1[CX4H]([CH2][CH2][CH2]1)[CX3](=[OX1])[OX2H,OX1-,N]",
"[$([NX3H2,NX4H3 ]),$([NX3H](C)(C))][CX4H]([*])[CX3](=[OX1])[OX2H,OX1-,N]",
"[CX3](=O)[OX1H0-,OX2H1]",
"[CX3](=[OX1])[OX2][CX3](=[OX1])",
"[N&H2& 0:4]-[C&H1& 0:2](-[C&H2& 0:8])-[O&H1& 0:3]"
]
# The second parameter is optional and flags whether atom mapping should be returned (defaults to False)
for smarts in test_smarts:
print(smarts, canon_smarts(smarts), canon_smarts(smarts, True))
For sanitizing reaction SMARTS:
from rdcanon import canon_reaction_smarts
To run all unit tests:
python rdcanon_tests.py
note: currently, "TestRecursive.test_validate_recursive_against_database" is expected to fail at: "[#1]C&X3=[O&X1] [#1,#6][C&X3;H1,H2]=[O&X1]". This is due to a strange edge case with merging query Hs, explicit hydrogens, and explicit/implicit connections.
No consolidation or expansion of atomic queries is performed automatically, but a mechanism is provided to allow the user to systematically replace canonicalized atomic queries with an input dictionary (e.g., {"[O;H1]": "[O;H1; 0]"} would replace the canonicalized variant of [O;H1] with the canonicalized variant of [O;H1; 0]).
Replacement dictionaries should be processed first using
canon_repl_dict = gen_canon_repl_dict(repl_dict)
before passing as an argument into canon_smarts.
Chirality or directionality beyond tetrahedral centers and cis/trans isomerism is not currently supported.
All data can be found in the manuscripts/data directory.
To create the bar charts of Figure 1, use the notebook within the manuscript directory named "prim_frequencies.ipynb".
To run the subgraph isomorphism experiments of Figure 3, use the notebook within the manuscript directory named "generate_plots_substruct_match.ipynb".
To run the template application experiments of Figure 4, use the notebook within the manuscript directory named "gen_plots_run_reactants_20240104.ipynb".
To run the retrosynthetic analysis experiments of Figure 5, use the notebook within the manuscript directory named "gen_retrosim_plots_20240105.ipynb".
The main workflow consists of two files, main.py and token_parser.py. The main.py file calls token_parser.py to parse and score atomic queries.
The files askcos_prims.py, drugbank_prims_with_nots.py, np_prims.py, and pubchem_prims.py are 4 query primitive frequency dictonaries, which are used for embedding leaf nodes in query trees.
The rdcanon_tests.py file contains all of the test cases using the abseil interface.
Finally, utils.py contains some helper functions for testing and plotting.