Automatic identification of NBOMe illicit psychoactive substances based on combined molecular descriptors

During the last decade, a growing prevalence of new psychoactive substances (NPS) has been noticed by the law enforcement agencies. Although NPS have no medical use due to their very high toxicity, they are often sold on the black market. NBOMe defines a group of toxic amphetamines that has as parent compound 25I-NBOMe, a synthetic derivative of 2C-I (2,5-dimethoxy-4-iodophenetylamine). In this paper, we are presenting a series of Artificial Neural Networks (ANNs) designed to identify the NBOMe class membership based on a mixture of topological and 3D-MoRSE descriptors. For this purpose, the molecular structures of 160 compounds representing NBOMe compounds, narcotics, sympathomimetic amines, potent analgesics, as well as their main precursors have been first optimized. Then a molecular database was formed by computing a large number of topological and 3D-MoRSE descriptors that characterize these structures. This database was used as input for building an ANN system designed to recognize NBOMes. The relevance of the input variables on its classification performance has been assessed and new systems have been built by using different combinations of selected topological and 3D-MoRSE descriptors. The best performing system has been found by comparing various classification efficiency criteria.


Introduction
In recent years, a class of potent synthetic psychoactive substances, referred to as dimethoxyphenyl -N-[(2-methoxyphenyl) methyl] ethanolamine derivatives (NBOMes), has been increasingly detected by law enforcement institutions in seizures of controlled substances. These amphetamines, abused for recreational purposes, were initially synthetized for mapping the brain usage of the 2A serotonin receptor. They are very toxic and hence have no medical use [1].
The most frequently seized NBOMe compound is 25I-NBOMe (2-(4-iodo-2,5dimethoxyphenyl)-N-[(2-methoxyphenyl)methyl] ethanamine) (see Figure 1). This compound is a derivative of 2C-I (2,5-dimethoxy-4-iodophenetylamine) and it is sold by the vendors of designer drugs under a variety of names, such as N-bomb, N-bome, Smiles, 25I, Bom-25 or Cimbi-5. 25I-NBOMe is very potent, being active in sub-milligram doses. Hence, the risk of overdose is very high. Nevertheless, due to its potency and much lower cost than classical psychedelics, it becomes increasingly popular in rave parties [2]. Clinical toxicology studies indicate that consumption of 25I-NBOMe usually generates confusion, panic and anxiety, visual hallucinations, auditory hallucinations, increased heart rate and blood pressure, thought loops, vasoconstriction, nausea, acute kidney injury or oxygen desaturation [3][4][5][6][7]. Artificial neural networks (ANNs) are statistical learning algorithms considered very useful machine learning tools in bioinformatics [8][9]. They simulate the structure and functionality of biological neural networks and have been successfully used for identifying patterns or for classification purposes. ANNs are trained by using algorithms based on the optimization theory and statistical estimation. Real gradients are determined based on the back propagation method and then, by applying a gradient lowering method, the derivative of a cost function is determined in relation to the network parameters. The last procedure determines the best network architecture by modifying the system parameters in a direction related to the gradient [10][11][12]. This way, ANN are very adequate for analyzing databases that are very large or have incomplete information [13][14][15].
In this paper, we are presenting and comparing the efficiency of a series of Artificial Neural Networks (ANNs) designed to recognize NBOMe drugs of abuse based on a mixture of topological and 3D-MoRSE descriptors. These molecular descriptors are the result of a mathematical procedure based on the graph theory, which converts a symbolic (2-or 3dimensional Euclidean) representation of a molecule into a numeric value, i.e. a theoretical descriptor. As these molecular descriptors contain important information about the associated molecular structure, they are very appropriate for describing and classifying substances, as well as for determining correlations between molecular structures and physico -chemical or biological properties.

Database and methods
The input database was formed with molecular descriptors calculated for a number of 160 controlled substances that were divided into a class of positives (referred to as NBOMe) and one of negatives (non-NBOMe). The class of positives contains 15 NBOMe hallucinogens and the class of negatives includes 145 compounds of forensic interest, such as narcotics, sympathomimetic amines, analgesics and their main precursors.
The 3D representation of the molecular structures was obtained with the HyperChem8.03 software [16] for all the 160 compounds forming the input database. The AM1 semi-empirical quantum method was applied for the full optimization of their geometries. The geometry was adjusted and the parameters in which the minimum energy of the molecular system is reached were determined based on the Polak-Ribiere mechanism.
A number of 96 topological descriptors and 80 3D-MoRSE descriptors were computed for each of the 160 compounds by using the Dragon 5.5 software [18]. They have been presented in detailed in a previous published article [19]. The detailed definitions, mathematical formulas and chemical significance of the topological descriptors and 3D-MoRSE (3D-Molecule Representation of Structures based on Electron diffraction) descriptors are detailed in Todeschini et al. [17].
A first ANN, named 176_topo+3D_ANN, was built with all these 176 molecular descriptors (96 topological descriptors and 80 3D-MoRSE descriptors) by using the Easy NN plus software. The system has three layers (the input, hidden and output layers) and use the sigmoid function as transfer function. The training set consists of 8 NBOMe amphetamines, 17 non-NBOMe compounds, while the validation set is formed by the remaining 135 substances. The output layer has two output nodes, i.e. NBOMe (positives) and non-NBOMe (negatives). Convergence was reached for the training process when the average training error falls below the target error of 0.01. The system was trained by using the backpropagation algorithm. Full cross-validation was performed based on the leave-one-out method. The resulting architecture of 176_topo+3D_ANN consists of 13 hidden nodes and 2314 weight connections.
This initial system was used to evaluate the relative importance of each input descriptors, i.e. its influence on the next layer in the network. Then a new network, named 54_topo+3D_imp_ANN, was built by including in the database only the first 54 most important descriptors, while using the same sets of compounds and mathematical procedures as for 176_topo+3D_ANN. This system has 12 hidden nodes and 672 weight connections after optimization.
176_topo+3D_ANN was also used to evaluate the sensitivity of the input descriptors, parameter that indicates how the outputs change when inputs are modified. Then a third system, 43_topo+3D_senz_ANN, was built with the first 43 most sensitive descriptors. After optimization, this system has 13 hidden nodes and 585 weight connections.

Results and discussion
The normalized and relative errors obtained during the training of 176_topo+3D_ANN is presented in Figure 2. The training process ended after 8 cycles. The first 30 most important descriptors, as determined with 176_topo+3D_ANN, are listed in Figure 3, in descending order. We notice that they include mostly 3D-MoRSE descriptors, the most important being Mor18p (signal 18/weighted by polarizability), Mor10u (signal 10/ unweighted) and Mor18v (signal 18/weighted by van der Waals volume). In fact, the first 11 most important variables are only 3D-MoRSE descriptors. However, a very positive aspect is that the relative importance is decreasing very slowly.  We may notice that the relative sensitivity decreases much faster that the relative importance. Hence, a smaller number of 3D-MoRSE and topological descriptors are worth selecting for building a new ANN system with the most sensitive descriptors. Fig. 4. The first 30 molecular descriptors found to have the highest relative sensitivity by analyzing the 176_topo+3D_ANN system.
In the case of 54_topo+3D_imp_ANN network (which includes 13 topological descriptors and 31 3D-MoRSE descriptors), 18 learning cycles were needed to end the training process, while 43_topo+3D_senz_ANN (built with 13 topological descriptors and 30 3D-MoRSE descriptors) reached convergence after only 5 learning cycles (see Figure 5 and Figure 6).   Table 1. They indicate that all ANNs are very efficient, being characterized by remarkably good figures of merit. The results indicate that all the ANNs presented in this study re exceptionally sensitive, as they are all detecting the NBOMe psychotropic drugs without exception (TPR = 100%, FNR = 0%). The systems are also very sensitive from the point of view of the negatives, the proportion of actual negatives recognized as such being also very good (TNR ≥ 92.36%) for all networks. The best results are obtained with 54_topo+3D_imp_ANN (TNR = 96.55%).
Only very few negatives are misclassified as NBOMes (FPR ≤ 7.64%). This figure of merit indicates best that the selection of variables based on their sensitivity, and especially on their importance, is a very useful step for optimizing the efficiency of these screening systems. The FPR obtained for 54_topo+3D_imp_ANN, the system built with the most important descriptors is less than half of the FPR rate obtained for the 176_topo+3D_ANN system.
In addition, the systems built with selected variables have a better capacity to assign a class identity to the analyte than 176_topo+3D_ANN. Both 54_topo+3D_imp_ANN and 43_topo+3D_senz_ANN were able to classify all the samples, while 176_topo+3D_ANN missed on sample (CR = 99.38%). Out of the classified samples, the systems built with selected descriptors classify correctly the same percentage of the samples, which by 3% higher than the CCR of 176_topo+3D_ANN.

Conclusions
The detection of NBOMe designer drugs is extremely important in forensic practice. The most important characteristic of any system screening for these drugs of abuse is its capacity of recognizing the positive samples, which should not be missed under any circumstance. The results of this study show that the 54_topo+3D_imp_ANN system is the most efficient system screening for NBOMe. This artificial intelligence application can be used successfully to predict and estimate the toxicity of any novel compounds having a molecular structure similar to NBOMe psychotropic drugs of abuse. This way, it may save the high costs of analytical and toxicological studies.
In addition, we should underline the benefits brought by mixing different types of molecular descriptors, in our case topological descriptors and 3D-MoRSE descriptors. This approach has led to selections of variables having higher importance / sensitivity than if the selections would have been made only with the same number of most important descriptors (54) or most sensitive ones (43) of the same type (only topological or only 3D-MoRSE descriptors). This is certainly one of the main reasons for the significant improvement of the