基于电子电离质谱数据和机器学习的新精神活性物质分类预测模型构建

Construction of Prediction Models for Classification of New Psychoactive Substances Based on EI-MS Data and Machine Learning

  • 摘要: 新精神活性物质的结构变化快速,给基于标准物质和质谱数据库筛选和鉴定这些新物质带来了挑战。本研究使用机器学习方法为未知新精神活性物质的结构鉴定提供新策略。基于871个质谱数据集构建了最近邻、支持向量机、随机森林和人工神经网络算法用于新精神活性物质的结构分类预测,采用5倍交叉验证的网格搜索对模型的超参数进行优化,使用混淆矩阵、准确度、精密度、召回率和f-分数评估4种分类预测模型的性能。结果表明,随机森林模型的预测能力最优,整体准确度可达89.27%,可以很好地对未知化合物结构类别进行预测,从而为未知化合物的结构鉴定提供依据。

     

    Abstract: New psychoactive substances (NPS) have become a global health and social problem. Their structures are variable and can be easily modified to produce new compounds. Traditional analytical techniques mostly rely on standard substances and mass spectrometry databases. The increased structural diversity of NPS makes the mass spectrometry databases be unable to comprehensively cover the mass spectra of all possible NPS, which in turn makes it difficult to perform structural identification of completely unknown compounds. Advances in machine learning have emerged as a potential solution to this dilemma. In this study, the k-nearestneighbor (KNN), support vector machine (SVM), random forests (RF) and artificial neural network (ANN) algorithms were constructed based on a dataset of mass spectra of 871 compounds. The four algorithmic models for identifying new psychoactive substances were used for structural classification prediction. The training and test sets were divided according to the ratio of 7:3, and the fit method was invoked on the training set to construct the model and train the parameters of the model, and the generalization ability of the model was evaluated on the test set. A grid search with 5-fold cross-validation was used to optimize the hyperparameters of the models. The performance of the four classification prediction models was evaluated by using the confusion matrix, accuracy, precision, recall and f-scores for each of the four models for characterizing 261 samples from the test set. Overall, the RF prediction model has the best classification prediction for the seven NPS as well as negative samples, with an overall accuracy of 89.27%, which is higher than the other three classification prediction models. The overall accuracies of the KNN, SVM, and ANN models are 79.31%, 83.14%, and 83.52%, respectively. In addition, the RF prediction model also has high accuracy for the NPS prediction of specific classes, and the accuracies for synthetic cathinones, fentanyl, synthetic cannabinoids, and benzodiazepines are 100%, 93%, 95%, and 100%, respectively, which can warrant good prediction for the structural classes of unknown compounds. In conclusion, this study develops a strategy for rapid analysis of new psychoactive substances using machine learning algorithms based on mass spectral datasets, realizing the classification prediction of structural classes of unknown compounds, thus providing a basis for the structural identification of unknown psychoactive compounds.

     

/

返回文章
返回