Unverified Commit 366c39e1 authored Jun 24, 2021 by Niklas Hölter Committed by GitHub Jun 24, 2021

Fix: Tokenizer is not able to encode triple bonds

Hi everyone,

i again found one minor bug in deepchems SMILES tokenizer. While tokenizing my dataset, i observed that the triple bond token ´#´ was not tokenized and instead simply leaved out by the SMILESTokenizer aswell as by the BasicSMILESTokenizer. I think this error occured because of the regex pattern, where a linebreak is placed directly in front of the ´#´. Removing the linebreak fixed it for me in a local copy of deepchem.

parent eab6ee63

deepchem/feat/smiles_tokenizer.py

+1 −2

Original line number	Diff line number	Diff line
		@@ -22,8 +22,7 @@ References
		1572-1583 DOI: 10.1021/acscentsci.9b00576
		"""

		SMI_REGEX_PATTERN = r"""(\[[^\]]+]\|Br?\|Cl?\|N\|O\|S\|P\|F\|I\|b\|c\|n\|o\|s\|p\|$\|$\|\.\|=\|
		#\|-\|\+\|\\\|\/\|:\|~\|@\|\?\|>>?\|\*\|\$\|\%[0-9]{2}\|[0-9])"""
		SMI_REGEX_PATTERN = r"""(\[[^\]]+]\|Br?\|Cl?\|N\|O\|S\|P\|F\|I\|b\|c\|n\|o\|s\|p\|$\|$\|\.\|=\|#\|-\|\+\|\\\|\/\|:\|~\|@\|\?\|>>?\|\*\|\$\|\%[0-9]{2}\|[0-9])"""

		# add vocab_file dict
		VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}

Admin message