ex6-SVM

AndrewNg 机器学习习题ex6-SVM

练习用数据

线性SVM

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#!/usr/bin/python
# coding=utf-8

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
from sklearn import svm

# 在一个简单的二位数据集中 SVM中不同的C处理结果
raw_data = loadmat('data/ex6data1.mat')
print(raw_data)

data = pd.DataFrame(raw_data['X'], columns=['X1', 'X2'])
data['y'] = raw_data['y']

positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]

fig, ax = plt.subplots(figsize=(12,8))

ax.scatter(positive['X1'], positive['X2'], s=50, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=50, marker='o', label='Negative')
ax.legend()
plt.show()

svc = svm.LinearSVC(C=1, loss='hinge', max_iter=1000)
print(svc)

# 首先看下C=1的结果
svc.fit(data[['X1', 'X2']], data['y'])
score = svc.score(data[['X1', 'X2']], data['y'])
print(score) # 0.9803921568627451

# 当C=100的时候
svc2 = svm.LinearSVC(C=100, loss='hinge', max_iter=1000)
svc2.fit(data[['X1', 'X2']], data['y'])
score2 = svc2.score(data[['X1', 'X2']], data['y'])
print(score2) # 0.9411764705882353 每次执行的结果可能不同

data['SVM 1 Confidence'] = svc.decision_function(data[['X1', 'X2']])
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(data['X1'], data['X2'], s=50, c=data['SVM 1 Confidence'], cmap='seismic')
ax.set_title('SVM (C=1) Decision Confidence')
plt.show()

data['SVM 2 Confidence'] = svc2.decision_function(data[['X1', 'X2']])

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=50, c=data['SVM 2 Confidence'], cmap='seismic')
ax.set_title('SVM (C=100) Decision Confidence')
plt.show()

image

image

image

高斯核函数

1
2
3
4
5
6
7
8
9
10
11
12
# 核函数
def gaussian_kernel(x1, x2, sigma):
return np.exp(-(np.sum((x1 - x2) ** 2) / (2 * (sigma ** 2))))


x1 = np.array([1.0, 2.0, 1.0])
x2 = np.array([0.0, 4.0, -1.0])
sigma = 2

gaussian_kernel(x1, x2, sigma)

# 0.32465246735834974

非线性决策边界

1
2
3
4
5
6
7
8
9
10
11
12
raw_data = loadmat('data/ex6data2.mat')
data = pd.DataFrame(raw_data['X'], columns=['X1', 'X2'])
data['y'] = raw_data['y']

positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]

fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(positive['X1'], positive['X2'], s=30, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=30, marker='o', label='Negative')
ax.legend()
plt.show()

image

对于该数据集,我们将使用内置的RBF内核构建支持向量机分类器,并检查其对训练数据的准确性。 为了可视化决策边界,这一次我们将根据实例具有负类标签的预测概率来对点做阴影。 从结果可以看出,它们大部分是正确的。

1
2
3
4
5
6
7
8
9
svc = svm.SVC(C=100, gamma=10, probability=True)
print(svc)

svc.fit(data[['X1', 'X2']], data['y'])
svc.score(data[['X1', 'X2']], data['y'])
data['Probability'] = svc.predict_proba(data[['X1', 'X2']])[:,0]
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=30, c=data['Probability'], cmap='Reds')
plt.show()

image

搜索最佳参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 搜索最佳参数
raw_data = loadmat('data/ex6data3.mat')
X = raw_data['X']
Xval = raw_data['Xval']
y = raw_data['y'].ravel()
yval = raw_data['yval']. ravel()

C_values = [0.001, 0.003, 0.1, 0.3, 1, 3, 10, 30, 100]
gamma_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]

best_score = 0
best_params = {'C': None, 'gamma':None}

for C in C_values:
for gamma in gamma_values:
svc = svm.SVC(C=C, gamma=gamma)
svc.fit(X, y)
score = svc.score(Xval, yval)

if score > best_score:
best_score = score
best_params['C'] = C
best_params['gamma'] = gamma

print(best_params, best_score)

{‘C’: 0.3, ‘gamma’: 100} 0.965

垃圾邮件过滤

现在,我们将进行第二部分的练习。 在这一部分中,我们的目标是使用SVM来构建垃圾邮件过滤器。 在练习文本中,有一个任务涉及一些文本预处理,以获得适合SVM处理的格式的数据。 然而,这个任务很简单(将字词映射到为练习提供的字典中的ID),而其余的预处理步骤(如HTML删除,词干,标准化等)已经完成。 我将跳过机器学习任务,而不是重现这些预处理步骤,其中包括从预处理过的训练集构建分类器,以及将垃圾邮件和非垃圾邮件转换为单词出现次数的向量的测试数据集。

1
2
3
4
5
6
7
8
9
10
11
12
13
# 垃圾邮件过滤
mat_tr = loadmat('data/spamTrain.mat')
X, y = mat_tr.get('X'), mat_tr.get('y').ravel()
print(X.shape, y.shape) # ((4000, 1899), (4000,))

mat_test = loadmat('data/spamTest.mat')
test_X, test_y = mat_test.get('Xtest'), mat_test.get('ytest').ravel()
print(test_X.shape, test_y.shape) # ((1000, 1899), (1000,))

svc = svm.SVC()
svc.fit(X, y)
pred = svc.predict(test_X)
print(metrics.classification_report(test_y, pred))

1
2
3
4
5
6
             precision    recall  f1-score   support

0 0.94 0.99 0.97 692
1 0.98 0.87 0.92 308

avg / total 0.95 0.95 0.95 1000

这个结果是使用使用默认参数的。 。

然后用逻辑回归来计算后精确的达到了99%

1
2
3
4
5
# 如果是逻辑回归呢?
logit = LogisticRegression()
logit.fit(X, y)
pred = logit.predict(test_X)
print(metrics.classification_report(test_y, pred))

1
2
3
4
5
6
             precision    recall  f1-score   support

0 1.00 0.99 0.99 692
1 0.97 0.99 0.98 308

avg / total 0.99 0.99 0.99 1000

调整参数后也可以达到和逻辑回归一样的精确度

1
2
3
4
svc = svm.SVC(C=100)
svc.fit(X, y)
pred = svc.predict(test_X)
print(metrics.classification_report(test_y, pred))

1
2
3
4
5
6
             precision    recall  f1-score   support

0 1.00 0.99 0.99 692
1 0.97 0.99 0.98 308

avg / total 0.99 0.99 0.99 1000
0%