1 | # 数据分析库 |
1 | train_df = pd.read_csv('/train.csv') |
1 | train_df.head() |
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.00000 | 1 | 0 | A/5 21171 | 7.25000 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.00000 | 1 | 0 | PC 17599 | 71.28330 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.00000 | 0 | 0 | STON/O2. 3101282 | 7.92500 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.00000 | 1 | 0 | 113803 | 53.10000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.00000 | 0 | 0 | 373450 | 8.05000 | NaN | S |
1 | test_df.head() |
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
0 | 892 | 3 | Kelly, Mr. James | male | 34.50000 | 0 | 0 | 330911 | 7.82920 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.00000 | 1 | 0 | 363272 | 7.00000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.00000 | 0 | 0 | 240276 | 9.68750 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.00000 | 0 | 0 | 315154 | 8.66250 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.00000 | 1 | 1 | 3101298 | 12.28750 | NaN | S |
1 | train_df.describe() |
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
count | 891.00000 | 891.00000 | 891.00000 | 714.00000 | 891.00000 | 891.00000 | 891.00000 |
mean | 446.00000 | 0.38384 | 2.30864 | 29.69912 | 0.52301 | 0.38159 | 32.20421 |
std | 257.35384 | 0.48659 | 0.83607 | 14.52650 | 1.10274 | 0.80606 | 49.69343 |
min | 1.00000 | 0.00000 | 1.00000 | 0.42000 | 0.00000 | 0.00000 | 0.00000 |
25% | 223.50000 | 0.00000 | 2.00000 | 20.12500 | 0.00000 | 0.00000 | 7.91040 |
50% | 446.00000 | 0.00000 | 3.00000 | 28.00000 | 0.00000 | 0.00000 | 14.45420 |
75% | 668.50000 | 1.00000 | 3.00000 | 38.00000 | 1.00000 | 0.00000 | 31.00000 |
max | 891.00000 | 1.00000 | 3.00000 | 80.00000 | 8.00000 | 6.00000 | 512.32920 |
1 | test_df.describe() |
PassengerId | Pclass | Age | SibSp | Parch | Fare | |
count | 418.00000 | 418.00000 | 332.00000 | 418.00000 | 418.00000 | 417.00000 |
mean | 1100.50000 | 2.26555 | 30.27259 | 0.44737 | 0.39234 | 35.62719 |
std | 120.81046 | 0.84184 | 14.18121 | 0.89676 | 0.98143 | 55.90758 |
min | 892.00000 | 1.00000 | 0.17000 | 0.00000 | 0.00000 | 0.00000 |
25% | 996.25000 | 1.00000 | 21.00000 | 0.00000 | 0.00000 | 7.89580 |
50% | 1100.50000 | 3.00000 | 27.00000 | 0.00000 | 0.00000 | 14.45420 |
75% | 1204.75000 | 3.00000 | 39.00000 | 1.00000 | 0.00000 | 31.50000 |
max | 1309.00000 | 3.00000 | 76.00000 | 8.00000 | 9.00000 | 512.32920 |
训练集有 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 共十二列数据,测试集中没有的Survived就是我们要预测的值。
1 | train_df.info() |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
观察发现,Age Cabin Embarked 存在缺失,并且数据类型既有数字也有字符串。测试集中 Fare 缺失了一个。
Name 是唯一的共 891
Sex 有两种,male 占 64.7%(top=male, freq/count=64.7%)
Ticket 不同的种类比较多
Cabin 有许多乘客在同一个 cabin
Embarked 有三种大多数是 S
1 | train_df.describe(include=["O"]) |
Name | Sex | Ticket | Cabin | Embarked | |
count | 891 | 891 | 891 | 204 | 889 |
unique | 891 | 2 | 681 | 147 | 3 |
top | Renouf, Mr. Peter Henry | male | 1601 | C23 C25 C27 | S |
freq | 1 | 577 | 7 | 4 | 644 |
根据我们的假设和决定,我们先放弃 Cabin 和 Ticket 。
1 | print("Before", train_df.shape, test_df.shape) |
Before (891, 12) (418, 11)
('After', (891, 10), (418, 9), (891, 10), (418, 9))
首先观察到 Name 都是唯一的并且在 Name 中间存在称谓,提取出名字中间的称谓。
1 | for dataset in combine: |
Sex | female | male |
Title | ||
Capt | 0 | 1 |
Col | 0 | 2 |
Countess | 1 | 0 |
Don | 0 | 1 |
Dr | 1 | 6 |
... | ... | ... |
Mr | 0 | 517 |
Mrs | 125 | 0 |
Ms | 1 | 0 |
Rev | 0 | 6 |
Sir | 0 | 1 |
17 rows × 2 columns
把称呼替换为更为常见的,不常见的定义为 Rare
1 | for dataset in combine: |
Title | Survived | |
0 | Master | 0.57500 |
1 | Miss | 0.70270 |
2 | Mr | 0.15667 |
3 | Mrs | 0.79365 |
4 | Rare | 0.34783 |
把 Titles 转换为数字
1 | title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5} |
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.00000 | 1 | 0 | 7.25000 | S | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.00000 | 1 | 0 | 71.28330 | C | 3 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.00000 | 0 | 0 | 7.92500 | S | 2 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.00000 | 1 | 0 | 53.10000 | S | 3 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.00000 | 0 | 0 | 8.05000 | S | 1 |
现在可以删除 Name 和 PassengerId了
1 | train_df = train_df.drop(["Name", "PassengerId"], axis=1) |
((891, 9), (418, 9))
让我们首先将 Sex 特征转换为一个新的 feature,其中 female = 1,male = 0。
1 | for dataset in combine: |
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
0 | 0 | 3 | 0 | 22.00000 | 1 | 0 | 7.25000 | S | 1 |
1 | 1 | 1 | 1 | 38.00000 | 1 | 0 | 71.28330 | C | 3 |
2 | 1 | 3 | 1 | 26.00000 | 0 | 0 | 7.92500 | S | 2 |
3 | 1 | 1 | 1 | 35.00000 | 1 | 0 | 53.10000 | S | 3 |
4 | 0 | 3 | 0 | 35.00000 | 0 | 0 | 8.05000 | S | 1 |
我们首先处理 Age 。
使用其他相关特征猜测缺失值。在这个例子中,我们注意到 Age,Gender 和 Pclass 之间的相关性。用猜年龄值中位值跨越套 Pclass 和性别特征组合年龄。因此,Pclass 的中位数年龄 = 1且性别 = 0,Pclass = 1 且性别 = 1,依此类推……
1 | grid = sns.FacetGrid(train_df, row="Pclass", col="Sex", size=2.2, aspect=1.6) |
<seaborn.axisgrid.FacetGrid at 0x7f3b28420a90>
首先准备一个空数组,猜测 Age 和 Pclass × Geender 有关系
1 | guess_ages = np.zeros((2, 3)) |
array([[0., 0., 0.],
[0., 0., 0.]])
现在我们迭代 Sex(0,1) 和 Pclass(1,2,3) 来猜测这六种组合的 Age。
1 | for dataset in combine: |
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
0 | 0 | 3 | 0 | 22 | 1 | 0 | 7.25000 | S | 1 |
1 | 1 | 1 | 1 | 38 | 1 | 0 | 71.28330 | C | 3 |
2 | 1 | 3 | 1 | 26 | 0 | 0 | 7.92500 | S | 2 |
3 | 1 | 1 | 1 | 35 | 1 | 0 | 53.10000 | S | 3 |
4 | 0 | 3 | 0 | 35 | 0 | 0 | 8.05000 | S | 1 |
让我们创建 AgeBand,并确定与存活的相关性。
1 | # 把Age分为5箱 |
AgeBand | Survived | |
0 | (-0.08, 16.0] | 0.55000 |
1 | (16.0, 32.0] | 0.33737 |
2 | (32.0, 48.0] | 0.41204 |
3 | (48.0, 64.0] | 0.43478 |
4 | (64.0, 80.0] | 0.09091 |
用这个频率来把 Age 分为五箱来替代原 Age。
1 | for dataset in combine: |
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | AgeBand | |
0 | 0 | 3 | 0 | 1 | 1 | 0 | 7.25000 | S | 1 | (16.0, 32.0] |
1 | 1 | 1 | 1 | 2 | 1 | 0 | 71.28330 | C | 3 | (32.0, 48.0] |
2 | 1 | 3 | 1 | 1 | 0 | 0 | 7.92500 | S | 2 | (16.0, 32.0] |
3 | 1 | 1 | 1 | 2 | 1 | 0 | 53.10000 | S | 3 | (32.0, 48.0] |
4 | 0 | 3 | 0 | 2 | 0 | 0 | 8.05000 | S | 1 | (32.0, 48.0] |
移除 AgeBand feature
1 | train_df = train_df.drop(["AgeBand"], axis=1) |
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
0 | 0 | 3 | 0 | 1 | 1 | 0 | 7.25000 | S | 1 |
1 | 1 | 1 | 1 | 2 | 1 | 0 | 71.28330 | C | 3 |
2 | 1 | 3 | 1 | 1 | 0 | 0 | 7.92500 | S | 2 |
3 | 1 | 1 | 1 | 2 | 1 | 0 | 53.10000 | S | 3 |
4 | 0 | 3 | 0 | 2 | 0 | 0 | 8.05000 | S | 1 |
Embarked 特征取值为 S、Q、C。我们的训练数据集有两个缺失的值。用最常见的 Embarked 来填充(众数填充)。
1 | freq_port = train_df.Embarked.dropna().mode()[0] # 返回出现次数最多的值(众数) |
1 | for dataset in combine: |
Embarked | Survived | |
0 | C | 0.55357 |
1 | Q | 0.38961 |
2 | S | 0.33901 |
现在,我们可以把 Embarked 转换为一个新的数字序列。
1 | for dataset in combine: |
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
0 | 0 | 3 | 0 | 1 | 1 | 0 | 7.25000 | 0 | 1 |
1 | 1 | 1 | 1 | 2 | 1 | 0 | 71.28330 | 1 | 3 |
2 | 1 | 3 | 1 | 1 | 0 | 0 | 7.92500 | 0 | 2 |
3 | 1 | 1 | 1 | 2 | 1 | 0 | 53.10000 | 0 | 3 |
4 | 0 | 3 | 0 | 2 | 0 | 0 | 8.05000 | 0 | 1 |
现在,我们可以使用 df.fillna 填充 test dataset 中 Fare 的单个缺失值,使用 median (中位数)来填充。我们只需要一行代码就可以做到这一点。
1 | test_df["Fare"].fillna(test_df["Fare"].dropna().median(), inplace=True) |
PassengerId | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
0 | 892 | 3 | 0 | 2 | 0 | 0 | 7.82920 | 2 | 1 |
1 | 893 | 3 | 1 | 2 | 1 | 0 | 7.00000 | 0 | 3 |
2 | 894 | 2 | 0 | 3 | 0 | 0 | 9.68750 | 2 | 1 |
3 | 895 | 3 | 0 | 1 | 0 | 0 | 8.66250 | 0 | 1 |
4 | 896 | 3 | 1 | 1 | 1 | 1 | 12.28750 | 0 | 3 |
创建一个 FareBand
1 | train_df["FareBand"] = pd.qcut(train_df["Fare"], 4) |
FareBand | Survived | |
0 | (-0.001, 7.91] | 0.19731 |
1 | (7.91, 14.454] | 0.30357 |
2 | (14.454, 31.0] | 0.45495 |
3 | (31.0, 512.329] | 0.58108 |
基于 FareBand 将 Fare 转换为序列值。
1 | for dataset in combine: |
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
0 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
1 | 1 | 1 | 1 | 2 | 1 | 0 | 3 | 1 | 3 |
2 | 1 | 3 | 1 | 1 | 0 | 0 | 1 | 0 | 2 |
3 | 1 | 1 | 1 | 2 | 1 | 0 | 3 | 0 | 3 |
4 | 0 | 3 | 0 | 2 | 0 | 0 | 1 | 0 | 1 |
1 | X_train = train_df.drop("Survived", axis=1) |
((891, 8), (891,), (418, 8))
1 | X_train.describe() |
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
0 | 3 | 0 | 2 | 0 | 0 | 0 | 2 | 1 |
1 | 3 | 1 | 2 | 1 | 0 | 0 | 0 | 3 |
2 | 2 | 0 | 3 | 0 | 0 | 1 | 2 | 1 |
3 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
4 | 3 | 1 | 1 | 1 | 1 | 1 | 0 | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
413 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
414 | 1 | 1 | 2 | 0 | 0 | 3 | 1 | 5 |
415 | 3 | 0 | 2 | 0 | 0 | 0 | 0 | 1 |
416 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
417 | 3 | 0 | 1 | 1 | 1 | 2 | 1 | 4 |
418 rows × 8 columns
我们的问题是一个分类和回归问题。我们想要确定输出(Survived or not)与其他变量或特性(Gender, Age, Port……)之间的关系。这是一个监督学习。有了这两个标准 —— 监督学习加上分类和回归,我们可以把模型的选择范围缩小到几个。
- 逻辑回归
- KNN或k近邻
- 支持向量机
- 朴素贝叶斯分类器
- 决策树
- 随机森林
- 感知器
- 随机梯度下降
- RVM 相关向量机
1 | # 逻辑回归 |
性别是正系数最高的,说明随着性别值的增加(男性: 0 女性:1),存活的概率增加最多。
相反,随着 Pclass 的增加,生存概率下降。
这样,Age 是一个很好的人工特征,因为它与存活有第二高的负相关。
Title 也是第二高的正相关。
1 | coeff_df = pd.DataFrame(train_df.columns.delete(0)) |
Feature | Correlation | |
1 | Sex | 2.19360 |
7 | Title | 0.49431 |
5 | Fare | 0.31180 |
6 | Embarked | 0.24051 |
4 | Parch | -0.25322 |
3 | SibSp | -0.50649 |
2 | Age | -0.65716 |
0 | Pclass | -0.91037 |
1 | # Support Vector Machines |
在模式识别中,k近邻算法(简称k-NN)是一种用于分类和回归的非参数算法。一个样本由它的邻居的多数投票来分类,这个样本被分配到它的k个最近邻居中最常见的类(k是一个正整数,通常很小)。如果 k = 1,那么对象就被简单地分配给那个最近邻居的类。
1 | # KNN |
1 | # Gaussian Naive Bayes |
1 | # Perceptron |
1 | # Linear SVC |
1 | # Stochastic Gradient Descent |
1 | # Decision Tree |
下一个模型随机森林是最受欢迎的之一。一个包含多个决策树的分类器, 并且其输出的类别是由个别树输出的类别的众数而定。
1 | # Random Forest |
1 | models = pd.DataFrame({ |
Model | Score | |
3 | Random Forest | 89.00000 |
8 | Decision Tree | 89.00000 |
1 | KNN | 84.40000 |
0 | Support Vector Machines | 83.73000 |
2 | Logistic Regression | 81.37000 |
7 | Linear SVC | 81.14000 |
6 | Stochastic Gradient Decent | 80.92000 |
5 | Percep tron | 80.58000 |
4 | Naive Bayes | 80.13000 |
1 | # 保存结果 |