核心点、边界点、离群值示例。
p 为和新对象
q 通过 r 密度连接到 p
优点 | 缺点 | 应用 |
---|---|---|
对离群值检测的健壮性 | 对 eps 和 MinPts 敏感 | 卫星图像 |
最适合分离高密度的聚类与低密度的聚类 | 若数据集过于稀疏,则不合适 | 异常检测 |
import pandas as pd
import numpy as np
df = pd.read_csv("/content/players_20.csv")
df = df[['short_name','age', 'height_cm', 'weight_kg', 'overall', 'potential','value_eur', 'wage_eur', 'international_reputation', 'weak_foot','skill_moves', 'release_clause_eur', 'team_jersey_number','contract_valid_until', 'nation_jersey_number', 'pace', 'shooting','passing', 'dribbling', 'defending', 'physic', 'gk_diving','gk_handling', 'gk_kicking', 'gk_reflexes', 'gk_speed','gk_positioning', 'attacking_crossing','attacking_finishing','attacking_heading_accuracy', 'attacking_short_passing','attacking_volleys', 'skill_dribbling', 'skill_curve','skill_fk_accuracy', 'skill_long_passing','skill_ball_control','movement_acceleration', 'movement_sprint_speed', 'movement_agility','movement_reactions', 'movement_balance', 'power_shot_power','power_jumping', 'power_stamina', 'power_strength', 'power_long_shots','mentality_aggression', 'mentality_interceptions','mentality_positioning', 'mentality_vision', 'mentality_penalties','mentality_composure', 'defending_marking', 'defending_standing_tackle','defending_sliding_tackle', 'goalkeeping_diving','goalkeeping_handling', 'goalkeeping_kicking','goalkeeping_positioning', 'goalkeeping_reflexes']]
df = df[df.overall > 86] # extracting players with overall above 86
df = df.fillna(df.mean())
names = df.short_name.tolist() # saving names for later
df = df.drop(['short_name'], axis = 1) # drop the short_name column
df.head()
from sklearn import preprocessing
x = df.values # numpy array
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)
X_norm = pd.DataFrame(x_scaled)
from sklearn.decomposition import PCA
pca = PCA(n_components = 2) # 2D PCA for the plot
reduced = pd.DataFrame(pca.fit_transform(X_norm))
from sklearn.cluster import DBSCAN
# train the model using DBSCAN
db = DBSCAN(eps=1, min_samples=5)
# the prediction for dbscan clusters
db_clusters = db.fit_predict(reduced)
reduced['cluster'] = db_clusters
reduced['name'] = names
reduced.columns = ['x', 'y', 'cluster', 'name']
reduced.head()
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="white")
ax = sns.lmplot(x="x", y="y", hue='cluster', data = reduced, legend=False,fit_reg=False, size = 10, scatter_kws={"s": 250})
texts = []
for x, y, s in zip(reduced.x, reduced.y, reduced.name):
texts.append(plt.text(x, y, s))
ax.set(ylim=(-2, 2))
plt.tick_params(labelsize=15)
plt.xlabel("PC 1", fontsize = 20)
plt.ylabel("PC 2", fontsize = 20)
plt.show()
DBSCAN,Eps=1,MinPts=5
from sklearn.neighbors import NearestNeighbors
# calculate the distance from each point to its closest neighbor
nn = NearestNeighbors(n_neighbors = 2)
# fit the nearest neighbor
nbrs = nn.fit(reduced)
# returns two arrays - distance to the closest n_neighbors points and index for each point
distances, indices = nbrs.kneighbors(reduced)
# sort the distance and plot it
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
寻找最佳 ε
from sklearn.cluster import DBSCAN
# train the model using DBSCAN
db= DBSCAN(eps=0.3, min_samples=4)
# prediction for dbscan clusters
db_clusters = db.fit_predict(reduced)
reduced['cluster'] = db_clusters
reduced['name'] = names
reduced.columns = ['x', 'y', 'cluster', 'name']
reduced.head()
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="white")
ax = sns.lmplot(x="x", y="y", hue='cluster', data = reduced, legend=False,fit_reg=False, size = 9, scatter_kws={"s": 250})
texts = []
for x, y, s in zip(reduced.x, reduced.y, reduced.name):
texts.append(plt.text(x, y, s))
ax.set(ylim=(-2, 2))
plt.tick_params(labelsize=10)
plt.xlabel("PC 1", fontsize = 20)
plt.ylabel("PC 2", fontsize = 20)
plt.show()
Eps=0.3,MinPts=4 的 DBSCAN 图
使用 Eps=0.3 和 MinPts=4 的 DBSCAN 在分组和检测离群值方面做得更好!
感谢阅读本文,希望对您有所帮助!
GitHub Repo: https://github.com/importdata/Clustering-FIFA-20-Players
作者介绍:
Jaemin Lee,专攻数据分析与数据科学,数据科学应届毕业生。
原文链接:
领取专属 10元无门槛券
私享最新 技术干货