Python实现K-均值算法

第一个无监督学习算法,K-均值,这是一个非常普及的聚类算法,实现起来也比较简单,学习了Andrew Ng的视频讲解,直接纪录一下重点吧。

首先训练集合选取了sklearn自带的多类单标签数据集make_blobs

初始化变量有m:训练集的个数,Feature:训练集的维度,K:要分成几类,u:一个K*Feature维度的数组,储存聚类中心,c:储存每次迭代的分类结果,uDict:储存分类结果的字典

总结:因为数据量比较少,根据观察畸变函的结果数,基本迭代三次就分类成功了,说明这是一个非常优秀的算法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
import random

# 构建单标签数据集
center = [[1,1],[-1,-1],[1,-1]]
cluster_std = 0.3
X,labels = make_blobs(n_samples=200, centers=center, n_features=2, cluster_std=cluster_std, random_state=0)
unique_lables = set(labels)
colors=plt.cm.Spectral(np.linspace(0, 1, len(unique_lables)))
for k,col in zip(unique_lables, colors):
x_k=X[labels==k]
plt.plot(x_k[:,0], x_k[:,1], 'o', markerfacecolor=col, markeredgecolor="k", markersize=14)

##########
Feature = 2 # 特征数
m = 200 # 训练数据个数
# 初始化K 和 聚类中心u
K = 3
u = np.empty([K, Feature])

for i in range(K):
u[i] = random.choice(X)

# c 储存分类结果和距离
c = np.zeros([m,2])

# 画出初始聚类中心
t = np.transpose(u)
plt.plot(t[0], t[1], '+', markerfacecolor='g', markeredgecolor="k", markersize=14)

# 储存分类结果的字典
uDict = {}


# 移动聚类中心
def MoveK(c):
u = np.empty([K, Feature])
for i in range(K):
uDict[i] = []
for i in range(m):
for j in range(K):
if(c[i][0] == j):
uDict[j].append(X[i])

for i in range(K):
sum = np.zeros([1, Feature])
for j in uDict[i]:
sum = np.add(sum, j)
u[i] = sum/len(uDict[i])
return u


# 畸变函数 Distortion function
def Distortion(u):
sum = 0
for i in uDict.keys():
for j in uDict[i]:
dis = np.linalg.norm(j - u[i])
sum += dis * dis
return sum/m


# 开始迭代
for t in range(5):
# 我希望找到 c[i](代表第i个数据) 距离 u[k](聚类中心) 最小
for i in range(m):
flag = True
for j in range(K):
dis = np.linalg.norm(X[i]-u[j])
if(flag or dis < c[i][1]):
flag = False
c[i][0] = j
c[i][1] = dis
u = MoveK(c)
print(Distortion(u))

# 验证结果
print("my kemans cluster enters:", u)
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
kmeans_u = kmeans.cluster_centers_
print("sklearn kemans cluster enters:", kmeans_u)

t = np.transpose(u)
plt.plot(t[0], t[1], '*', markerfacecolor='blue', markeredgecolor="k", markersize=14)
plt.show()

执行结果:

0.8093064467708514
0.2770795968584342
0.17288024424551154
0.17288024424551154
0.17288024424551154

my kemans cluster enters: [[ 0.95712283 -1.02057236]
[ 1.01281413 1.06595402]
[-1.03507066 -1.03233287]]

sklearn kemans cluster enters: [[ 0.95712283 -1.02057236]
[ 1.01281413 1.06595402]
[-1.03507066 -1.03233287]]

image

算法成功的从+号的位置移动到五角星的位置。

0%