log P(油水分配系数)是确定化合物是否适合用作药物的最重要属性之一。当前,用于计算机预测log P的大多数可用回归模型都在实验测得的log P值(PHYSPROP数据库)。但是,该数据库中的大多数化合物并不高度代表药物样化学空间。不幸的是,当前缺乏可用于训练更好的预测工具的公开可用的实验log P数据集。
此测试使用论文中发布的实验log P数据:“Large, chemically diverse dataset of log P measurements for benchmarking studies” [1]。
到目前为止,用于log P预测的许多可用工具都基于物理描述符,例如原子类型计数或极性表面积或拓扑描述符。这里将计算分子的不同物理描述符以及结构指纹,并使用三种不同的回归模型(神经网络,随机森林和支持向量机)对它们的性能进行基准测试。
导入库和utility模块
pca = PCA(n_components=3)
res = pca.fit_transform(X)
col = {0:'blue', 1:'red'}color = [col[np.int(i)] for i in Y]plt.figure(figsize=(10,7))plt.scatter(res[:,0], res[:,1], c=color, alpha=0.5)
utility.py获取地址:https://gist.github.com/ravila4/a18bb9d0e3f54de3c9fb8f75908f992f载入数据
data = pd.read_csv("logp_759_data.csv",encoding='ANSI')# Filter for validated resultsdata_logp = data[data.Status == "Validated"]print("Shape:", data_logp.shape)data_logp.head()
将SMILES转换为RDKit的Mol对象
data_logp['molecules'] = data_logp.SMILES.apply(Chem.MolFromSmiles)data_logp.head()
计算描述符
data_logp.loc[:, 'MolLogP'] = molecules.apply(Descriptors.MolLogP)data_logp.loc[:, 'HeavyAtomCount'] = molecules.apply(Descriptors.HeavyAtomCount)data_logp.loc[:, 'HAccept'] = molecules.apply(Descriptors.NumHAcceptors)data_logp.loc[:, 'Heteroatoms'] = molecules.apply(Descriptors.NumHeteroatoms)data_logp.loc[:, 'HDonor'] = molecules.apply(Descriptors.NumHDonors)data_logp.loc[:, 'MolWt'] = molecules.apply(Descriptors.MolWt)data_logp.loc[:, 'RotableBonds'] = molecules.apply(Descriptors.NumRotatableBonds)data_logp.loc[:, 'RingCount'] = molecules.apply(Descriptors.RingCount)data_logp.loc[:, 'Ipc'] = molecules.apply(Descriptors.Ipc)data_logp.loc[:, 'HallKierAlpha'] = molecules.apply(Descriptors.HallKierAlpha)data_logp.loc[:, 'NumValenceElectrons'] = molecules.apply(Descriptors.NumValenceElectrons)data_logp.loc[:, 'SaturatedRings'] = molecules.apply(Descriptors.NumSaturatedRings)data_logp.loc[:, 'AliphaticRings'] = molecules.apply(Descriptors.NumAliphaticRings)data_logp.loc[:, 'AromaticRings'] = molecules.apply(Descriptors.NumAromaticRings)
作为基准,计算RDKit计算出的MolLogP与实验记录P的性能
r2 = r2_score(data_logP.logPexp, data_logP.MolLogP)mse = mean_squared_error(data_logP.logPexp, data_logP.MolLogP)plt.scatter(data_logP.logPexp, data_logP.MolLogP, label = "MSE: {:.2f}\nR^2: {:.2f}".format(mse, r2))plt.legend()plt.show()
RDKit计算的log P预测具有较高的均方误差,并且该数据集的确定系数较弱。RDKit的MolLogP实现基于原子贡献。因此,将首先尝试使用上面生成的RDKit物理描述符训练我们自己的简单logP模型。
训练简单的描述符模型
X = data_logp.iloc[:, 8:]y = data_logp.logPexpX.head()
随机森林模型
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Default modelsmodels = {"rf": RandomForestRegressor(n_estimators=100, random_state=42)}
scores = {}for m in models: models[m].fit(X_train, y_train) scores[m + "_train"] = models[m].score(X_train, y_train, ) y_pred = models[m].predict(X_test) scores[m + "_test"] = r2_score(y_test, y_pred) scores[m + "_mse_test"] = mean_squared_error(y_test, y_pred)
scores of our model
scores = pd.Series(scores).Tscores
rftrain 0.940748 rftest 0.597391 rfmsetest 0.609630 dtype: float64
r2 = r2_score(y_test, y_pred)mse = mean_squared_error(y_test, y_pred)plt.scatter(y_test, y_pred, label = "MSE: {:.2f}\nR^2: {:.2f}".format(mse, r2))plt.legend()plt.show()
作为基准,计算RDKit计算出的MolLogP与实验记录LogP的性能
r2 = r2_score(data_logP.logPexp, data_logP.MolLogP)mse = mean_squared_error(data_logP.logPexp, data_logP.MolLogP)plt.scatter(data_logP.logPexp, data_logP.MolLogP, label = "MSE: {:.2f}\nR^2: {:.2f}".format(mse, r2))plt.legend()plt.show()
将描述符与scikit-learn的默认随机森林配合使用,可以使获得比RDKit log P预测值更高的R2和MSE性能。但是,这很可能是由于使用的训练集与他们用来开发模型的训练集之间的差异。可以通过调整随机森林参数来提高性能,然后在PHYSPROP数据集上测量性能。
计算指纹已经看到了简单分子描述符的性能,想评估一些最流行的分子指纹的性能。在许多可用方法中,将测试Morgan指纹(ECFP4和ECFP6),RDKFingerprints和拓扑药效团指纹(TPAPF和TPATF),脚本可从MayaChemTools获得。
MayaChemTools: http://www.mayachemtools.org
import multiprocessingfrom joblib import Parallel, delayed
def applyParallel(df, func): n_jobs=multiprocessing.cpu_count() groups = np.array_split(df, n_jobs) results = Parallel(n_jobs)(delayed(lambda g: g.apply(func))(group) for group in groups) return pd.concat(results)
fps = {"ECFP4": molecules.apply(lambda m: AllChem.GetMorganFingerprintAsBitVect(m, radius=2, nBits=2048)), "ECFP6": molecules.apply(lambda m: AllChem.GetMorganFingerprintAsBitVect(m, radius=3, nBits=2048)), "RDKFP": molecules.apply(lambda m: AllChem.RDKFingerprint(m, fpSize=2048)), "TPATF": applyParallel(data_logp.SMILES, lambda m: FeatureGenerator(m).toTPATF()), "TPAPF": applyParallel(data_logp.SMILES, lambda m: FeatureGenerator(m).toTPAPF())}
建立具有不同指纹的基线模型
y = data_logp.logPexp
# Default modelsmodels = {"rf": RandomForestRegressor(n_estimators=100, random_state=42), "nnet": MLPRegressor(random_state=42), "svr": SVR(gamma='auto')}
scores = {}
for f in fps: scores[f] = {} # Convert fps to 2D numpy array X = np.array(fps[f].tolist()) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) for m in models: models[m].fit(X_train, y_train) #scores[f][m + "_r2_train"] = models[m].score(X_train, y_train) y_pred = models[m].predict(X_test) scores[f][m + "_r2_test"] = r2_score(y_test, y_pred) scores[f][m + "_mse_test"] = mean_squared_error(y_test, y_pred)
scores_df = pd.DataFrame(scores).Tscores_df
总体而言,TPATF指纹性能最好,甚至胜过简单描述符模型。在所有回归方法中,默认随机森林的性能最佳,尽管在对模型参数进行一些优化后,这种可能性很可能会改变。
参考资料