机器学习实战(二)

k-邻近算法概述

k-邻近算法采用测量不同特征之间的距离方法进行分类。

优点:精度高,对异常值不敏感,无数据输入假定
缺点:计算复杂度高,空间复杂度高
适用数据范围:数值型和标称型

准备使用Python导入数据

首先写一个简单的程序来理解python是如何解析和加载数据的

1
2
3
4
5
6
7
from numpy import *
import operator

def createDataSet():
group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
labels = ['A', 'A', 'B', 'B']
return group, labels
1
group, labels = createDataSet()
1
group
array([[1. , 1.1],
       [1. , 1. ],
       [0. , 0. ],
       [0. , 0.1]])
1
labels
['A', 'A', 'B', 'B']

实施KNN分类算法

对未知类别属性的数据集中的每个点依次执行以下操作:

  1. 计算已知类别数据集中的点与当前点之间的距离
  2. 按照距离递增次序排序
  3. 选取与当前距离最小的k个点
  4. 确定前k个点所在类别的出现频率
  5. 返回前k个点出现频率最高的类别当作当前点的预测分类

计算距离的方法使用欧氏距离计算公式:

$$d=\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}$$

具体程序如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX, (dataSetSize, 1)) - dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances ** 0.5
sortedDistIndicies = distances.argsort()
classCount = {}
for i in range(k):
voteIlabe1 = labels[sortedDistIndicies[i]]
classCount[voteIlabe1] = classCount.get(voteIlabe1, 0) + 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]

classify0 有四个参数

inX:需要分类的向量
dataSet:训练集的特征
labels:训练集的标签
k:最近邻居的数目

测试下我们写的算法是否正确,我们传入(0,0)的点,正确的分类点应该是B点

1
classify0([0,0], group, labels, 3)
'B'

使用k-邻近算法改进约会网站的配对结果

约会网站通过三个特征把样本分为3类

每年飞行里程 玩游戏消耗的时间百分比 每周消耗的冰淇淋公升数 样本分类
(largeDoses,smallDoses,didntLike)

准备数据:从文本文件中解析数据

共有1000行数据存放在datingTestSet2.txt中,行之间用\n分割,列之间\t分割,现在需要把这个文件里的数据转换为一个向量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def file2matrix(filename):
fr = open(filename, 'r')
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines)
returnMat = zeros((numberOfLines, 3))
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
listFromLine = line.split('\t')
returnMat[index, :] = listFromLine[0: 3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
fr.close()
return returnMat, classLabelVector
1
datingDataMat, datingLabels = file2matrix('./MLiA_SourceCode/machinelearninginaction/Ch02/datingTestSet2.txt')
1
datingDataMat
array([[4.0920000e+04, 8.3269760e+00, 9.5395200e-01],
       [1.4488000e+04, 7.1534690e+00, 1.6739040e+00],
       [2.6052000e+04, 1.4418710e+00, 8.0512400e-01],
       ...,
       [2.6575000e+04, 1.0650102e+01, 8.6662700e-01],
       [4.8111000e+04, 9.1345280e+00, 7.2804500e-01],
       [4.3757000e+04, 7.8826010e+00, 1.3324460e+00]])
1
datingLabels[:20]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

分析数据:使用matplotlib绘制散点图

数据使用第二列和第三列的值分别是

x轴:玩游戏消耗的时间百分比
y轴:每周消费的冰淇淋公升数
1
2
3
4
5
6
7
import matplotlib
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2])
plt.show()

png

从上图很难看到有用的信息,我们通过颜色标记不同的样本

1
2
3
4
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2], 15.0*array(datingLabels), 15.0*array(datingLabels))
plt.show()

png

利用颜色和尺寸我们基本可以看出三种样本的区域轮廓。

准备数据:归一化数值

我们发现飞行里程的数值远远大于另两个数值,这样我们计算的结果会被飞行里程严重影响,我们认为三个特征是同等重要的,所以作为三个等权重的特征,飞行里程不应该严重影响结果。

在处理这种不同取值范围的特征时,我们需要先归一化数值,将数值范围处理到0到1或-1到1之间,通常使用的公式是

$$newValue = (oldValue-min) / max-min)$$

程序如下:

1
2
3
4
5
6
7
8
9
def autoNorm(dataSet):
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - tile(minVals, (m, 1))
normDataSet = normDataSet/tile(ranges, (m, 1))
return normDataSet, ranges, minVals
1
normMat, ranges, minVals = autoNorm(datingDataMat)
1
normMat
array([[0.44832535, 0.39805139, 0.56233353],
       [0.15873259, 0.34195467, 0.98724416],
       [0.28542943, 0.06892523, 0.47449629],
       ...,
       [0.29115949, 0.50910294, 0.51079493],
       [0.52711097, 0.43665451, 0.4290048 ],
       [0.47940793, 0.3768091 , 0.78571804]])
1
ranges
array([9.1273000e+04, 2.0919349e+01, 1.6943610e+00])
1
minVals
array([0.      , 0.      , 0.001156])

测试算法:构建完整程序验证分类器

选择90%的数据作为训练集,10%作为测试集,记录每一次错误的分类,最后用预测错误的总数除以测试集数据的总数就计算出了错误率。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def datingClassTest():
hoRatio = 0.10
datingDataMat, datingLabels = file2matrix('./MLiA_SourceCode/machinelearninginaction/Ch02/datingTestSet2.txt')
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :],\
datingLabels[numTestVecs:m], 3)
#print("the classifier came back with: %d, the real answer is: %d"% (classifierResult, datingLabels[i]))
if classifierResult != datingLabels[i]:
errorCount += 1.0
print("the total error rate is: %f"%(errorCount/float(numTestVecs)))
1
datingClassTest()
the total error rate is: 0.050000

这里计算出的错误率为5%,书中写的为2.4%,担心算错用source code跑了一遍错误率为6.6%

使用算法:构建完整的可用系统

手写识别系统

现在开始第二个示例,识别手写数字,数据已经被图形软件处理为文本格式,先打开几个观察一下

1
2
3
4
5
def readTrainingDigits(filename):
fr = open(filename, 'r')
data = fr.read()
print(data)
fr.close()
1
2
3
readTrainingDigits('./MLiA_SourceCode/machinelearninginaction/Ch02/digits/trainingDigits/0_0.txt')
readTrainingDigits('./MLiA_SourceCode/machinelearninginaction/Ch02/digits/trainingDigits/1_0.txt')
readTrainingDigits('./MLiA_SourceCode/machinelearninginaction/Ch02/digits/trainingDigits/2_0.txt')
00000000000001111000000000000000
00000000000011111110000000000000
00000000001111111111000000000000
00000001111111111111100000000000
00000001111111011111100000000000
00000011111110000011110000000000
00000011111110000000111000000000
00000011111110000000111100000000
00000011111110000000011100000000
00000011111110000000011100000000
00000011111100000000011110000000
00000011111100000000001110000000
00000011111100000000001110000000
00000001111110000000000111000000
00000001111110000000000111000000
00000001111110000000000111000000
00000001111110000000000111000000
00000011111110000000001111000000
00000011110110000000001111000000
00000011110000000000011110000000
00000001111000000000001111000000
00000001111000000000011111000000
00000001111000000000111110000000
00000001111000000001111100000000
00000000111000000111111000000000
00000000111100011111110000000000
00000000111111111111110000000000
00000000011111111111110000000000
00000000011111111111100000000000
00000000001111111110000000000000
00000000000111110000000000000000
00000000000011000000000000000000

00000000000000001111000000000000
00000000000000011111111000000000
00000000000000011111111100000000
00000000000000011111111110000000
00000000000000011111111110000000
00000000000000111111111100000000
00000000000000111111111100000000
00000000000001111111111100000000
00000000000000111111111100000000
00000000000000111111111100000000
00000000000000111111111000000000
00000000000001111111111000000000
00000000000011111111111000000000
00000000000111111111110000000000
00000000001111111111111000000000
00000001111111111111111000000000
00000011111111111111110000000000
00000111111111111111110000000000
00000111111111111111110000000000
00000001111111111111110000000000
00000001111111011111110000000000
00000000111100011111110000000000
00000000000000011111110000000000
00000000000000011111100000000000
00000000000000111111110000000000
00000000000000011111110000000000
00000000000000011111110000000000
00000000000000011111111000000000
00000000000000011111111000000000
00000000000000011111111000000000
00000000000000000111111110000000
00000000000000000111111100000000

00000000001111111000000000000000
00000000011111111100000000000000
00000000011111111110000000000000
00000000011111111111100000000000
00000000111111111111100000000000
00000001111111111111110000000000
00000011111110001111110000000000
00000001111110000111111000000000
00000001111110000111111000000000
00000001111110000111111000000000
00000001111100000111111000000000
00000001111110000011111100000000
00000001111111000011111100000000
00000000111111000011111000000000
00000000111110000111111000000000
00000000001110000011111100000000
00000000000000000011111000000000
00000000000000000111110000000000
00000000000000000111111000000000
00000000000000001111110000000000
00000000000000011111110000000000
00000000000000111111100000000000
00000000000000011111110000000000
00000000000000111111100000000000
00000000000001111111000000000000
00000000000011111110000000000000
00000000001111111111111111111000
00000000011111111111111111111100
00000000111111111111111111111100
00000000011111111111111111111100
00000000001111111111111111111100
00000000000111111111111111110000

这是一个32×32的数字矩阵并且可以很明显的看出数字的轮廓。

准备数据:将图像转换为测试向量

如上所示,每一个文件都是一个32×32构成的数字矩阵,为了使用前面写的分类器,我们必须将图像格式化处理为一个1×1024的向量

1
2
3
4
5
6
7
8
def img2vector(filename):
returnVect = zeros((1,1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVect[0,32*i+j] = int(lineStr[j])
return returnVect
1
2
testVector = img2vector('./MLiA_SourceCode/machinelearninginaction/Ch02/digits/testDigits/0_13.txt')
testVector[0, 0:31]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
1
testVector[0, 32:63]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

测试算法:使用k-近邻算法识别手写数字

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import os
from tqdm import trange

def handwritingClassTest():
hwLabels = []
trainingFileList = os.listdir('./MLiA_SourceCode/machinelearninginaction/Ch02/digits/trainingDigits') #load the training set
m = len(trainingFileList)
trainingMat = zeros((m,1024))
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr = fileNameStr.split('.')[0] #take off .txt
classNumStr = int(fileStr.split('_')[0])
hwLabels.append(classNumStr)
trainingMat[i,:] = img2vector('./MLiA_SourceCode/machinelearninginaction/Ch02/digits/trainingDigits/%s' % fileNameStr)
testFileList = os.listdir('./MLiA_SourceCode/machinelearninginaction/Ch02/digits/testDigits') #iterate through the test set
errorCount = 0.0
mTest = len(testFileList)
for i in trange(mTest):
fileNameStr = testFileList[i]
fileStr = fileNameStr.split('.')[0] #take off .txt
classNumStr = int(fileStr.split('_')[0])
vectorUnderTest = img2vector('./MLiA_SourceCode/machinelearninginaction/Ch02/digits/testDigits/%s' % fileNameStr)
classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
#print ("the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr))
if (classifierResult != classNumStr): errorCount += 1.0
print ("\nthe total number of errors is: %d" % errorCount)
print ("\nthe total error rate is: %f" % (errorCount/float(mTest)))
1
handwritingClassTest()
100%|██████████| 946/946 [00:30<00:00, 31.30it/s]


the total number of errors is: 10

the total error rate is: 0.010571

错误率只有1%,

总结

k-近邻算法是分类数据最简单有效的算法,使用算法时我们必须有接近实际数据的训练样本,必须保存全部数据集,如果训练数据集很大,必须使用大量的存储空间,此外,由于必须对数据集中的每个数据计算距离值,实际使用会非常耗时,它的另一个缺陷是无法给出任何数据的基本结构信息,因此我们无法知晓平均实例样本和典型实例样本具体有什么特征。

0%