Kaggle Competition 的练习
1 | # 数据分析库 |
1 | train_df = pd.read_csv('/train.csv') |
((1460, 81), (1459, 80))
1 | train_df.head() |
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | ... | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.00000 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196.00000 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | ... | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NaN | Attchd | 2003.00000 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.00000 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0.00000 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | ... | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976.00000 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.00000 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162.00000 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | ... | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001.00000 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.00000 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0.00000 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | ... | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998.00000 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.00000 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350.00000 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | ... | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000.00000 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
1 | # test_df.head() |
1 | train_df.describe() |
Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | TotRmsAbvGrd | Fireplaces | GarageYrBlt | GarageCars | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1460.00000 | 1460.00000 | 1201.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1452.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1379.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 |
mean | 730.50000 | 56.89726 | 70.04996 | 10516.82808 | 6.09932 | 5.57534 | 1971.26781 | 1984.86575 | 103.68526 | 443.63973 | 46.54932 | 567.24041 | 1057.42945 | 1162.62671 | 346.99247 | 5.84452 | 1515.46370 | 0.42534 | 0.05753 | 1.56507 | 0.38288 | 2.86644 | 1.04658 | 6.51781 | 0.61301 | 1978.50616 | 1.76712 | 472.98014 | 94.24452 | 46.66027 | 21.95411 | 3.40959 | 15.06096 | 2.75890 | 43.48904 | 6.32192 | 2007.81575 | 180921.19589 |
std | 421.61001 | 42.30057 | 24.28475 | 9981.26493 | 1.38300 | 1.11280 | 30.20290 | 20.64541 | 181.06621 | 456.09809 | 161.31927 | 441.86696 | 438.70532 | 386.58774 | 436.52844 | 48.62308 | 525.48038 | 0.51891 | 0.23875 | 0.55092 | 0.50289 | 0.81578 | 0.22034 | 1.62539 | 0.64467 | 24.68972 | 0.74732 | 213.80484 | 125.33879 | 66.25603 | 61.11915 | 29.31733 | 55.75742 | 40.17731 | 496.12302 | 2.70363 | 1.32810 | 79442.50288 |
min | 1.00000 | 20.00000 | 21.00000 | 1300.00000 | 1.00000 | 1.00000 | 1872.00000 | 1950.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 334.00000 | 0.00000 | 0.00000 | 334.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 2.00000 | 0.00000 | 1900.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 2006.00000 | 34900.00000 |
25% | 365.75000 | 20.00000 | 59.00000 | 7553.50000 | 5.00000 | 5.00000 | 1954.00000 | 1967.00000 | 0.00000 | 0.00000 | 0.00000 | 223.00000 | 795.75000 | 882.00000 | 0.00000 | 0.00000 | 1129.50000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 2.00000 | 1.00000 | 5.00000 | 0.00000 | 1961.00000 | 1.00000 | 334.50000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 5.00000 | 2007.00000 | 129975.00000 |
50% | 730.50000 | 50.00000 | 69.00000 | 9478.50000 | 6.00000 | 5.00000 | 1973.00000 | 1994.00000 | 0.00000 | 383.50000 | 0.00000 | 477.50000 | 991.50000 | 1087.00000 | 0.00000 | 0.00000 | 1464.00000 | 0.00000 | 0.00000 | 2.00000 | 0.00000 | 3.00000 | 1.00000 | 6.00000 | 1.00000 | 1980.00000 | 2.00000 | 480.00000 | 0.00000 | 25.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 6.00000 | 2008.00000 | 163000.00000 |
75% | 1095.25000 | 70.00000 | 80.00000 | 11601.50000 | 7.00000 | 6.00000 | 2000.00000 | 2004.00000 | 166.00000 | 712.25000 | 0.00000 | 808.00000 | 1298.25000 | 1391.25000 | 728.00000 | 0.00000 | 1776.75000 | 1.00000 | 0.00000 | 2.00000 | 1.00000 | 3.00000 | 1.00000 | 7.00000 | 1.00000 | 2002.00000 | 2.00000 | 576.00000 | 168.00000 | 68.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 8.00000 | 2009.00000 | 214000.00000 |
max | 1460.00000 | 190.00000 | 313.00000 | 215245.00000 | 10.00000 | 9.00000 | 2010.00000 | 2010.00000 | 1600.00000 | 5644.00000 | 1474.00000 | 2336.00000 | 6110.00000 | 4692.00000 | 2065.00000 | 572.00000 | 5642.00000 | 3.00000 | 2.00000 | 3.00000 | 2.00000 | 8.00000 | 3.00000 | 14.00000 | 3.00000 | 2010.00000 | 4.00000 | 1418.00000 | 857.00000 | 547.00000 | 552.00000 | 508.00000 | 480.00000 | 738.00000 | 15500.00000 | 12.00000 | 2010.00000 | 755000.00000 |
1 | # test_df.describe() |
1 | train_df.info() |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
__________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
Id 1459 non-null int64
MSSubClass 1459 non-null int64
MSZoning 1455 non-null object
LotFrontage 1232 non-null float64
LotArea 1459 non-null int64
Street 1459 non-null object
Alley 107 non-null object
LotShape 1459 non-null object
LandContour 1459 non-null object
Utilities 1457 non-null object
LotConfig 1459 non-null object
LandSlope 1459 non-null object
Neighborhood 1459 non-null object
Condition1 1459 non-null object
Condition2 1459 non-null object
BldgType 1459 non-null object
HouseStyle 1459 non-null object
OverallQual 1459 non-null int64
OverallCond 1459 non-null int64
YearBuilt 1459 non-null int64
YearRemodAdd 1459 non-null int64
RoofStyle 1459 non-null object
RoofMatl 1459 non-null object
Exterior1st 1458 non-null object
Exterior2nd 1458 non-null object
MasVnrType 1443 non-null object
MasVnrArea 1444 non-null float64
ExterQual 1459 non-null object
ExterCond 1459 non-null object
Foundation 1459 non-null object
BsmtQual 1415 non-null object
BsmtCond 1414 non-null object
BsmtExposure 1415 non-null object
BsmtFinType1 1417 non-null object
BsmtFinSF1 1458 non-null float64
BsmtFinType2 1417 non-null object
BsmtFinSF2 1458 non-null float64
BsmtUnfSF 1458 non-null float64
TotalBsmtSF 1458 non-null float64
Heating 1459 non-null object
HeatingQC 1459 non-null object
CentralAir 1459 non-null object
Electrical 1459 non-null object
1stFlrSF 1459 non-null int64
2ndFlrSF 1459 non-null int64
LowQualFinSF 1459 non-null int64
GrLivArea 1459 non-null int64
BsmtFullBath 1457 non-null float64
BsmtHalfBath 1457 non-null float64
FullBath 1459 non-null int64
HalfBath 1459 non-null int64
BedroomAbvGr 1459 non-null int64
KitchenAbvGr 1459 non-null int64
KitchenQual 1458 non-null object
TotRmsAbvGrd 1459 non-null int64
Functional 1457 non-null object
Fireplaces 1459 non-null int64
FireplaceQu 729 non-null object
GarageType 1383 non-null object
GarageYrBlt 1381 non-null float64
GarageFinish 1381 non-null object
GarageCars 1458 non-null float64
GarageArea 1458 non-null float64
GarageQual 1381 non-null object
GarageCond 1381 non-null object
PavedDrive 1459 non-null object
WoodDeckSF 1459 non-null int64
OpenPorchSF 1459 non-null int64
EnclosedPorch 1459 non-null int64
3SsnPorch 1459 non-null int64
ScreenPorch 1459 non-null int64
PoolArea 1459 non-null int64
PoolQC 3 non-null object
Fence 290 non-null object
MiscFeature 51 non-null object
MiscVal 1459 non-null int64
MoSold 1459 non-null int64
YrSold 1459 non-null int64
SaleType 1458 non-null object
SaleCondition 1459 non-null object
dtypes: float64(11), int64(26), object(43)
memory usage: 912.0+ KB
1 | train_df.describe(include="O") |
MSZoning | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinType2 | Heating | HeatingQC | CentralAir | Electrical | KitchenQual | Functional | FireplaceQu | GarageType | GarageFinish | GarageQual | GarageCond | PavedDrive | PoolQC | Fence | MiscFeature | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1460 | 1460 | 91 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1452 | 1460 | 1460 | 1460 | 1423 | 1423 | 1422 | 1423 | 1422 | 1460 | 1460 | 1460 | 1459 | 1460 | 1460 | 770 | 1379 | 1379 | 1379 | 1379 | 1460 | 7 | 281 | 54 | 1460 | 1460 |
unique | 5 | 2 | 2 | 4 | 4 | 2 | 5 | 3 | 25 | 9 | 8 | 5 | 8 | 6 | 8 | 15 | 16 | 4 | 4 | 5 | 6 | 4 | 4 | 4 | 6 | 6 | 6 | 5 | 2 | 5 | 4 | 7 | 5 | 6 | 3 | 5 | 5 | 3 | 3 | 4 | 4 | 9 | 6 |
top | RL | Pave | Grvl | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | Gable | CompShg | VinylSd | VinylSd | None | TA | TA | PConc | TA | TA | No | Unf | Unf | GasA | Ex | Y | SBrkr | TA | Typ | Gd | Attchd | Unf | TA | TA | Y | Gd | MnPrv | Shed | WD | Normal |
freq | 1151 | 1454 | 50 | 925 | 1311 | 1459 | 1052 | 1382 | 225 | 1260 | 1445 | 1220 | 726 | 1141 | 1434 | 515 | 504 | 864 | 906 | 1282 | 647 | 649 | 1311 | 953 | 430 | 1256 | 1428 | 741 | 1365 | 1334 | 735 | 1360 | 380 | 870 | 605 | 1311 | 1326 | 1340 | 3 | 157 | 49 | 1267 | 1198 |
1 |
分析概要
Feature | Status | Dispose |
---|---|---|
Alley | 缺失比较多 | 删除 |
PoolQC | 只有七家有游泳池并且和 PoolArea 相关 | 先不填充 删除 |
Fence | 栏杆质量只有20%的有 | 缺失的填充为没有 |
MiscFeature | 其他项目也只有好少的房子有 | 先不填充 删除 |
FireplaceQu | 有一半家没有壁炉 | 填 0 |
Garagetype | 空代表没有 | 填 0 |
Garagefinish | 空代表没有 | 填 0 |
Garagequal | 空代表没有 | 填 0 |
Garagecond | 空代表没有 | 填 0 |
LotFrontage | 和物业相连的街道有1/3缺失 | 没想到太好的填充方法 删除 |
整理 description 文件
数据描述文件记录了所有特征所代表的含义,其中许多特征是字符串,现在我们要整理为个字典,便于我们查询。
1 | description_dict = {} |
1 | for i in description_data: |
{'MSSubClass': ['20', '30', '40', '45', '50', '60', '70', '75', '80', '85', '90', '120', '150', '160', '180', '190'], 'MSZoning': ['A', 'C', 'FV', 'I', 'RH', 'RL', 'RP', 'RM'], 'LotFrontage': [], 'LotArea': [], 'Street': ['Grvl', 'Pave'], 'Alley': ['Grvl', 'Pave', 'NA'], 'LotShape': ['Reg', 'IR1', 'IR2', 'IR3'], 'LandContour': ['Lvl', 'Bnk', 'HLS', 'Low'], 'Utilities': ['AllPub', 'NoSewr', 'NoSeWa', 'ELO'], 'LotConfig': ['Inside', 'Corner', 'CulDSac', 'FR2', 'FR3'], 'LandSlope': ['Gtl', 'Mod', 'Sev'], 'Neighborhood': ['Blmngtn', 'Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr', 'Crawfor', 'Edwards', 'Gilbert', 'IDOTRR', 'MeadowV', 'Mitchel', 'Names', 'NoRidge', 'NPkVill', 'NridgHt', 'NWAmes', 'OldTown', 'SWISU', 'Sawyer', 'SawyerW', 'Somerst', 'StoneBr', 'Timber', 'Veenker'], 'Condition1': ['Artery', 'Feedr', 'Norm', 'RRNn', 'RRAn', 'PosN', 'PosA', 'RRNe', 'RRAe'], 'Condition2': ['Artery', 'Feedr', 'Norm', 'RRNn', 'RRAn', 'PosN', 'PosA', 'RRNe', 'RRAe'], 'BldgType': ['1Fam', '2FmCon', 'Duplx', 'TwnhsE', 'TwnhsI'], 'HouseStyle': ['1Story'], ' 1.5Fin\tOne and one-half story': [], ' 1.5Unf\tOne and one-half story': ['2Story'], ' 2.5Fin\tTwo and one-half story': [], ' 2.5Unf\tTwo and one-half story': ['SFoyer', 'SLvl'], 'OverallQual': ['10', '9', '8', '7', '6', '5', '4', '3', '2', '1'], 'OverallCond': ['10', '9', '8', '7', '6', '5', '4', '3', '2', '1'], 'YearBuilt': [], 'YearRemodAdd': [], 'RoofStyle': ['Flat', 'Gable', 'Gambrel', 'Hip', 'Mansard', 'Shed'], 'RoofMatl': ['ClyTile', 'CompShg', 'Membran', 'Metal', 'Roll', 'Tar&Grv', 'WdShake', 'WdShngl'], 'Exterior1st': ['AsbShng', 'AsphShn', 'BrkComm', 'BrkFace', 'CBlock', 'CemntBd', 'HdBoard', 'ImStucc', 'MetalSd', 'Other', 'Plywood', 'PreCast', 'Stone', 'Stucco', 'VinylSd', 'Wd', 'WdShing'], 'Exterior2nd': ['AsbShng', 'AsphShn', 'BrkComm', 'BrkFace', 'CBlock', 'CemntBd', 'HdBoard', 'ImStucc', 'MetalSd', 'Other', 'Plywood', 'PreCast', 'Stone', 'Stucco', 'VinylSd', 'Wd', 'WdShing'], 'MasVnrType': ['BrkCmn', 'BrkFace', 'CBlock', 'None', 'Stone'], 'MasVnrArea': [], 'ExterQual': ['Ex', 'Gd', 'TA', 'Fa', 'Po'], 'ExterCond': ['Ex', 'Gd', 'TA', 'Fa', 'Po'], 'Foundation': ['BrkTil', 'CBlock', 'PConc', 'Slab', 'Stone', 'Wood'], 'BsmtQual': ['Ex', 'Gd', 'TA', 'Fa', 'Po', 'NA'], 'BsmtCond': ['Ex', 'Gd', 'TA', 'Fa', 'Po', 'NA'], 'BsmtExposure': ['Gd', 'Av', 'Mn', 'No', 'NA'], 'BsmtFinType1': ['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'NA'], 'BsmtFinSF1': [], 'BsmtFinType2': ['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'NA'], 'BsmtFinSF2': [], 'BsmtUnfSF': [], 'TotalBsmtSF': [], 'Heating': ['Floor', 'GasA', 'GasW', 'Grav', 'OthW', 'Wall'], 'HeatingQC': ['Ex', 'Gd', 'TA', 'Fa', 'Po'], 'CentralAir': ['N', 'Y'], 'Electrical': ['SBrkr', 'FuseA', 'FuseF', 'FuseP', 'Mix'], '1stFlrSF': [], '2ndFlrSF': [], 'LowQualFinSF': [], 'GrLivArea': [], 'BsmtFullBath': [], 'BsmtHalfBath': [], 'FullBath': [], 'HalfBath': [], 'Bedroom': [], 'Kitchen': [], 'KitchenQual': ['Ex', 'Gd', 'TA', 'Fa', 'Po'], 'TotRmsAbvGrd': [], 'Functional': ['Typ', 'Min1', 'Min2', 'Mod', 'Maj1', 'Maj2', 'Sev', 'Sal'], 'Fireplaces': [], 'FireplaceQu': ['Ex', 'Gd', 'TA', 'Fa', 'Po', 'NA'], 'GarageType': ['2Types', 'Attchd', 'Basment', 'BuiltIn', 'CarPort', 'Detchd', 'NA'], 'GarageYrBlt': [], 'GarageFinish': ['Fin', 'RFn', 'Unf', 'NA'], 'GarageCars': [], 'GarageArea': [], 'GarageQual': ['Ex', 'Gd', 'TA', 'Fa', 'Po', 'NA'], 'GarageCond': ['Ex', 'Gd', 'TA', 'Fa', 'Po', 'NA'], 'PavedDrive': ['Y', 'P', 'N'], 'WoodDeckSF': [], 'OpenPorchSF': [], 'EnclosedPorch': [], '3SsnPorch': [], 'ScreenPorch': [], 'PoolArea': [], 'PoolQC': ['Ex', 'Gd', 'TA', 'Fa', 'NA'], 'Fence': ['GdPrv', 'MnPrv', 'GdWo', 'MnWw', 'NA'], 'MiscFeature': ['Elev', 'Gar2', 'Othr', 'Shed', 'TenC', 'NA'], 'MiscVal': [], 'MoSold': [], 'YrSold': [], 'SaleType': ['WD', 'CWD', 'VWD', 'New', 'COD', 'Con', 'ConLw', 'ConLI', 'ConLD', 'Oth'], 'SaleCondition': ['Normal', 'Abnorml', 'AdjLand', 'Alloca', 'Family', 'Partial']}
1 | description_dict['FireplaceQu'] |
['Ex', 'Gd', 'TA', 'Fa', 'Po', 'NA']
预处理
首先先删除一些确实较多和不太好填充的feature。
1 | # 删除 |
处理 GarageYrBlt: Year garage was built
车库的年代,没有填充 0,改为 1900 年开始。
1 | def preprocess_garage_year(dataset): |
处理 Electrical
Electrical: Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed
1 | freq_port = train_df.Electrical.dropna().mode()[0] # 返回出现次数最多的值(众数) |
'SBrkr'
1 | def preprocess_garage_year(dataset): |
处理 MasVnrArea: Masonry veneer area in square feet
砖石饰面面积:砖石饰面面积(平方英尺)
缺失的不是太多(148),mean 103,众数(75%以上)为 0,还没想到太好的填充,先填个0试试吧。
1 | train_df.MasVnrArea.describe() |
count 1452.00000
mean 103.68526
std 181.06621
min 0.00000
25% 0.00000
50% 0.00000
75% 166.00000
max 1600.00000
Name: MasVnrArea, dtype: float64
1 | def preprocess_masvararea(dataset): |
处理 MasVnrType
MasVnrType: Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
这个值很奇怪,不太明白这是什么,是没有好呢还是 Stone 好呢?
1 | train_df.MasVnrType.dropna().mode()[0] # 返回出现次数最多的值(众数) |
'None'
大多数都没有,那就把缺失值填为没有吧。
1 | def preprocess_masvnrtype(dataset): |
处理其他缺失 feature
需要填充缺失值和重编码。
根据 data description 把 字符串类型的 feature 重编码。
构造 feature 对应的 map
观察发现以下这些缺失我们可以填充,顺便重编码。
1 | missing_value = ['Fence', |
1 | def generate_map(map_list, end_index=1): |
1 | missing_map_dict = {} |
{'BsmtCond': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'NA': 0, 'Po': 1, 'TA': 3},
'BsmtExposure': {'Av': 3, 'Gd': 4, 'Mn': 2, 'NA': 0, 'No': 1},
'BsmtFinType1': {'ALQ': 5,
'BLQ': 4,
'GLQ': 6,
'LwQ': 2,
'NA': 0,
'Rec': 3,
'Unf': 1},
'BsmtFinType2': {'ALQ': 5,
'BLQ': 4,
'GLQ': 6,
'LwQ': 2,
'NA': 0,
'Rec': 3,
'Unf': 1},
'BsmtQual': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'NA': 0, 'Po': 1, 'TA': 3},
'Fence': {'GdPrv': 4, 'GdWo': 2, 'MnPrv': 3, 'MnWw': 1, 'NA': 0},
'FireplaceQu': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'NA': 0, 'Po': 1, 'TA': 3},
'GarageCond': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'NA': 0, 'Po': 1, 'TA': 3},
'GarageFinish': {'Fin': 3, 'NA': 0, 'RFn': 2, 'Unf': 1},
'GarageQual': {'Ex': 5, 'Fa': 2, 'Gd': 4, 'NA': 0, 'Po': 1, 'TA': 3},
'GarageType': {'2Types': 6,
'Attchd': 5,
'Basment': 4,
'BuiltIn': 3,
'CarPort': 2,
'Detchd': 1,
'NA': 0}}
1 | # 预处理 feature 把 str 转换为序列 |
1 | def preprocess_feature(dataset): |
1 | train_df = preprocess_feature(train_df) |
检查训练集缺失值,已经没有了。
1 | train_df[train_df.isnull().values==True] |
Id | MSSubClass | MSZoning | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | Fence | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
---|
填充测试集
测试集还有许多缺失,先决定用众数填充。
1 | test_df[test_df.isnull().values==True] |
Id | MSSubClass | MSZoning | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | Fence | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
95 | 1556 | 50 | RL | 10632 | Pave | IR1 | Lvl | AllPub | Inside | Gtl | ClearCr | Norm | Norm | 1Fam | 1.5Fin | 5 | 3 | 1917 | 1950 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0.00000 | TA | TA | BrkTil | 4.00000 | 2.00000 | 1.00000 | 1.00000 | 0.00000 | 1.00000 | 0.00000 | 689.00000 | 689.00000 | GasA | Gd | N | SBrkr | 725 | 499 | 0 | 1224 | 0.00000 | 0.00000 | 1 | 1 | 3 | 1 | NaN | 6 | Mod | 0 | 0.00000 | 1.00000 | 17.00000 | 1.00000 | 1.00000 | 180.00000 | 2.00000 | 2.00000 | N | 0 | 0 | 248 | 0 | 0 | 0 | 0.00000 | 0 | 1 | 2010 | COD | Normal |
455 | 1916 | 30 | NaN | 21780 | Grvl | Reg | Lvl | NaN | Inside | Gtl | IDOTRR | Norm | Norm | 1Fam | 1Story | 2 | 4 | 1910 | 1950 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0.00000 | Fa | Fa | CBlock | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | GasA | TA | N | FuseA | 810 | 0 | 0 | 810 | 0.00000 | 0.00000 | 1 | 0 | 1 | 1 | TA | 4 | Min1 | 0 | 0.00000 | 1.00000 | 75.00000 | 1.00000 | 1.00000 | 280.00000 | 3.00000 | 3.00000 | N | 119 | 24 | 0 | 0 | 0 | 0 | 0.00000 | 0 | 3 | 2009 | ConLD | Normal |
455 | 1916 | 30 | NaN | 21780 | Grvl | Reg | Lvl | NaN | Inside | Gtl | IDOTRR | Norm | Norm | 1Fam | 1Story | 2 | 4 | 1910 | 1950 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0.00000 | Fa | Fa | CBlock | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | GasA | TA | N | FuseA | 810 | 0 | 0 | 810 | 0.00000 | 0.00000 | 1 | 0 | 1 | 1 | TA | 4 | Min1 | 0 | 0.00000 | 1.00000 | 75.00000 | 1.00000 | 1.00000 | 280.00000 | 3.00000 | 3.00000 | N | 119 | 24 | 0 | 0 | 0 | 0 | 0.00000 | 0 | 3 | 2009 | ConLD | Normal |
485 | 1946 | 20 | RL | 31220 | Pave | IR1 | Bnk | NaN | FR2 | Gtl | Gilbert | Feedr | Norm | 1Fam | 1Story | 6 | 2 | 1952 | 1952 | Hip | CompShg | BrkFace | BrkFace | None | 0.00000 | TA | TA | CBlock | 3.00000 | 3.00000 | 1.00000 | 1.00000 | 0.00000 | 1.00000 | 0.00000 | 1632.00000 | 1632.00000 | GasA | TA | Y | FuseA | 1474 | 0 | 0 | 1474 | 0.00000 | 0.00000 | 1 | 0 | 3 | 1 | TA | 7 | Min2 | 2 | 4.00000 | 5.00000 | 52.00000 | 1.00000 | 2.00000 | 495.00000 | 3.00000 | 3.00000 | Y | 0 | 0 | 144 | 0 | 0 | 0 | 0.00000 | 750 | 5 | 2008 | WD | Normal |
660 | 2121 | 20 | RM | 5940 | Pave | IR1 | Lvl | AllPub | FR3 | Gtl | BrkSide | Feedr | Norm | 1Fam | 1Story | 4 | 7 | 1946 | 1950 | Gable | CompShg | MetalSd | CBlock | None | 0.00000 | TA | TA | PConc | 0.00000 | 0.00000 | 0.00000 | 0.00000 | nan | 0.00000 | nan | nan | nan | GasA | TA | Y | FuseA | 896 | 0 | 0 | 896 | nan | nan | 1 | 0 | 2 | 1 | TA | 4 | Typ | 0 | 0.00000 | 1.00000 | 46.00000 | 1.00000 | 1.00000 | 280.00000 | 3.00000 | 3.00000 | Y | 0 | 0 | 0 | 0 | 0 | 0 | 3.00000 | 0 | 4 | 2008 | ConLD | Abnorml |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1013 | 2474 | 50 | RM | 10320 | Pave | Reg | Lvl | AllPub | Corner | Gtl | IDOTRR | Artery | Norm | 1Fam | 1.5Fin | 4 | 1 | 1910 | 1950 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0.00000 | Fa | Fa | CBlock | 3.00000 | 2.00000 | 1.00000 | 1.00000 | 0.00000 | 1.00000 | 0.00000 | 771.00000 | 771.00000 | GasA | Fa | Y | SBrkr | 866 | 504 | 114 | 1484 | 0.00000 | 0.00000 | 2 | 0 | 3 | 1 | TA | 6 | NaN | 0 | 0.00000 | 1.00000 | 10.00000 | 1.00000 | 1.00000 | 264.00000 | 3.00000 | 2.00000 | N | 14 | 211 | 0 | 0 | 84 | 0 | 0.00000 | 0 | 9 | 2007 | COD | Abnorml |
1029 | 2490 | 20 | RL | 13770 | Pave | Reg | Lvl | AllPub | Corner | Gtl | Sawyer | Feedr | Norm | 1Fam | 1Story | 5 | 6 | 1958 | 1998 | Gable | CompShg | Plywood | Plywood | BrkFace | 340.00000 | TA | TA | CBlock | 3.00000 | 3.00000 | 2.00000 | 3.00000 | 190.00000 | 4.00000 | 873.00000 | 95.00000 | 1158.00000 | GasA | TA | Y | SBrkr | 1176 | 0 | 0 | 1176 | 1.00000 | 0.00000 | 1 | 0 | 3 | 1 | TA | 6 | Typ | 2 | 4.00000 | 5.00000 | 58.00000 | 1.00000 | 1.00000 | 303.00000 | 3.00000 | 3.00000 | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0.00000 | 0 | 10 | 2007 | NaN | Normal |
1116 | 2577 | 70 | RM | 9060 | Pave | Reg | Lvl | AllPub | Inside | Gtl | IDOTRR | Norm | Norm | 1Fam | 2Story | 5 | 6 | 1923 | 1999 | Gable | CompShg | Wd Sdng | Plywood | None | 0.00000 | TA | TA | BrkTil | 4.00000 | 3.00000 | 1.00000 | 5.00000 | 548.00000 | 1.00000 | 0.00000 | 311.00000 | 859.00000 | GasA | Ex | Y | SBrkr | 942 | 886 | 0 | 1828 | 0.00000 | 0.00000 | 2 | 0 | 3 | 1 | Gd | 6 | Typ | 0 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | nan | nan | 0.00000 | 0.00000 | Y | 174 | 0 | 212 | 0 | 0 | 0 | 3.00000 | 0 | 3 | 2007 | WD | Alloca |
1116 | 2577 | 70 | RM | 9060 | Pave | Reg | Lvl | AllPub | Inside | Gtl | IDOTRR | Norm | Norm | 1Fam | 2Story | 5 | 6 | 1923 | 1999 | Gable | CompShg | Wd Sdng | Plywood | None | 0.00000 | TA | TA | BrkTil | 4.00000 | 3.00000 | 1.00000 | 5.00000 | 548.00000 | 1.00000 | 0.00000 | 311.00000 | 859.00000 | GasA | Ex | Y | SBrkr | 942 | 886 | 0 | 1828 | 0.00000 | 0.00000 | 2 | 0 | 3 | 1 | Gd | 6 | Typ | 0 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | nan | nan | 0.00000 | 0.00000 | Y | 174 | 0 | 212 | 0 | 0 | 0 | 3.00000 | 0 | 3 | 2007 | WD | Alloca |
1444 | 2905 | 20 | NaN | 31250 | Pave | Reg | Lvl | AllPub | Inside | Gtl | Mitchel | Artery | Norm | 1Fam | 1Story | 1 | 3 | 1951 | 1951 | Gable | CompShg | CBlock | VinylSd | None | 0.00000 | TA | Fa | CBlock | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | GasA | TA | Y | FuseA | 1600 | 0 | 0 | 1600 | 0.00000 | 0.00000 | 1 | 1 | 3 | 1 | TA | 6 | Mod | 0 | 0.00000 | 5.00000 | 51.00000 | 1.00000 | 1.00000 | 270.00000 | 2.00000 | 3.00000 | N | 0 | 0 | 135 | 0 | 0 | 0 | 0.00000 | 0 | 5 | 2006 | WD | Normal |
22 rows × 76 columns
1 | for i in test_df.columns.values.tolist(): |
字符串类型 feature 重编码
处理完缺失值后观察下还有那些feature是字符串形式的。
1 | train_df.describe(include="O") |
MSZoning | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | ExterQual | ExterCond | Foundation | Heating | HeatingQC | CentralAir | Electrical | KitchenQual | Functional | PavedDrive | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 |
unique | 5 | 2 | 4 | 4 | 2 | 5 | 3 | 25 | 9 | 8 | 5 | 8 | 6 | 8 | 15 | 16 | 4 | 4 | 5 | 6 | 6 | 5 | 2 | 5 | 4 | 7 | 3 | 9 | 6 |
top | RL | Pave | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | Gable | CompShg | VinylSd | VinylSd | None | TA | TA | PConc | GasA | Ex | Y | SBrkr | TA | Typ | Y | WD | Normal |
freq | 1151 | 1454 | 925 | 1311 | 1459 | 1052 | 1382 | 225 | 1260 | 1445 | 1220 | 726 | 1141 | 1434 | 515 | 504 | 872 | 906 | 1282 | 647 | 1428 | 741 | 1365 | 1335 | 735 | 1360 | 1340 | 1267 | 1198 |
目前还有以下的 feature 需要编码
[‘MSZoning’, ‘Street’, ‘LotShape’, ‘LandContour’, ‘Utilities’, ‘LotConfig’, ‘LandSlope’, ‘Neighborhood’, ‘Condition1’, ‘Condition2’, ‘BldgType’, ‘HouseStyle’, ‘RoofStyle’, ‘RoofMatl’, ‘Exterior1st’, ‘Exterior2nd’, ‘MasVnrType’, ‘ExterQual’, ‘ExterCond’, ‘Foundation’, ‘Heating’, ‘HeatingQC’, ‘CentralAir’, ‘Electrical’, ‘KitchenQual’, ‘Functional’, ‘PavedDrive’, ‘SaleType’, ‘SaleCondition’]
1 | feature = ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition'] |
{'RL', 'FV', 'C (all)', 'RM', 'RH'}
{'Grvl', 'Pave'}
{'IR2', 'Reg', 'IR3', 'IR1'}
{'Lvl', 'Low', 'HLS', 'Bnk'}
{'AllPub', 'NoSeWa'}
{'CulDSac', 'FR2', 'Corner', 'FR3', 'Inside'}
{'Gtl', 'Mod', 'Sev'}
{'BrDale', 'NoRidge', 'Blmngtn', 'Sawyer', 'NPkVill', 'NridgHt', 'SawyerW', 'Mitchel', 'OldTown', 'NWAmes', 'NAmes', 'Somerst', 'Veenker', 'SWISU', 'CollgCr', 'BrkSide', 'ClearCr', 'IDOTRR', 'Crawfor', 'StoneBr', 'Timber', 'Gilbert', 'Blueste', 'MeadowV', 'Edwards'}
{'Artery', 'PosN', 'PosA', 'Norm', 'RRNn', 'RRAe', 'RRNe', 'Feedr', 'RRAn'}
{'Artery', 'PosN', 'PosA', 'Norm', 'RRNn', 'RRAe', 'Feedr', 'RRAn'}
{'2fmCon', 'Duplex', 'Twnhs', 'TwnhsE', '1Fam'}
{'SFoyer', '1.5Fin', '2Story', '1.5Unf', '2.5Fin', '2.5Unf', '1Story', 'SLvl'}
{'Mansard', 'Shed', 'Gable', 'Flat', 'Gambrel', 'Hip'}
{'Roll', 'Metal', 'ClyTile', 'WdShngl', 'CompShg', 'Tar&Grv', 'Membran', 'WdShake'}
{'BrkFace', 'Plywood', 'MetalSd', 'Stucco', 'WdShing', 'CBlock', 'AsphShn', 'Stone', 'ImStucc', 'CemntBd', 'Wd Sdng', 'VinylSd', 'BrkComm', 'AsbShng', 'HdBoard'}
{'BrkFace', 'Plywood', 'Wd Shng', 'MetalSd', 'CmentBd', 'Stucco', 'Other', 'CBlock', 'AsphShn', 'ImStucc', 'Stone', 'Wd Sdng', 'VinylSd', 'AsbShng', 'Brk Cmn', 'HdBoard'}
{'BrkFace', 'None', 'BrkCmn', 'Stone'}
{'Ex', 'Gd', 'Fa', 'TA'}
{'Fa', 'Gd', 'Po', 'TA', 'Ex'}
{'BrkTil', 'PConc', 'CBlock', 'Stone', 'Slab', 'Wood'}
{'GasA', 'GasW', 'Wall', 'Floor', 'Grav', 'OthW'}
{'Fa', 'Gd', 'Po', 'TA', 'Ex'}
{'Y', 'N'}
{'SBrkr', 'Mix', 'FuseA', 'FuseP', 'FuseF'}
{'Ex', 'Gd', 'Fa', 'TA'}
{'Mod', 'Min1', 'Maj1', 'Min2', 'Maj2', 'Typ', 'Sev'}
{'Y', 'N', 'P'}
{'Con', 'ConLD', 'ConLw', 'CWD', 'WD', 'New', 'ConLI', 'Oth', 'COD'}
{'Abnorml', 'Alloca', 'Normal', 'AdjLand', 'Partial', 'Family'}
观察到这些 feature 即有优劣等级划分的,也有没有等级的,既然如此,把有等级区分的编码为序列,没有等级的使用独热码。
有等级的:
MSZoning
ExterQual
ExterCond
HeatingQC
KitchenQual
其余的使用onehot编码。
1 | order_feature_set ={'MSZoning', |
Order 编码
1 | def preprocess_order_feature(dataset): |
1 | train_df = preprocess_order_feature(train_df) |
1 | train_df.describe() |
Id | MSSubClass | MSZoning | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | ExterQual | ExterCond | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | HeatingQC | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | Fence | MiscVal | MoSold | YrSold | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 |
mean | 730.50000 | 56.89726 | 2.82534 | 10516.82808 | 6.09932 | 5.57534 | 1971.26781 | 1984.86575 | 103.11712 | 3.39589 | 3.08356 | 3.48904 | 2.93493 | 1.63014 | 3.54589 | 443.63973 | 1.24726 | 46.54932 | 567.24041 | 1057.42945 | 4.14521 | 1162.62671 | 346.99247 | 5.84452 | 1515.46370 | 0.42534 | 0.05753 | 1.56507 | 0.38288 | 2.86644 | 1.04658 | 3.51164 | 6.51781 | 0.61301 | 1.82534 | 3.51438 | 74.15068 | 1.71575 | 1.76712 | 472.98014 | 2.81027 | 2.80890 | 94.24452 | 46.66027 | 21.95411 | 3.40959 | 15.06096 | 2.75890 | 0.56575 | 43.48904 | 6.32192 | 2007.81575 | 180921.19589 |
std | 421.61001 | 42.30057 | 1.02017 | 9981.26493 | 1.38300 | 1.11280 | 30.20290 | 20.64541 | 180.73137 | 0.57428 | 0.35105 | 0.87648 | 0.55216 | 1.06739 | 2.10778 | 456.09809 | 0.89233 | 161.31927 | 441.86696 | 438.70532 | 0.95950 | 386.58774 | 436.52844 | 48.62308 | 525.48038 | 0.51891 | 0.23875 | 0.55092 | 0.50289 | 0.81578 | 0.22034 | 0.66376 | 1.62539 | 0.64467 | 1.81088 | 1.93321 | 29.98205 | 0.89283 | 0.74732 | 213.80484 | 0.72290 | 0.71969 | 125.33879 | 66.25603 | 61.11915 | 29.31733 | 55.75742 | 40.17731 | 1.20448 | 496.12302 | 2.70363 | 1.32810 | 79442.50288 |
min | 1.00000 | 20.00000 | 0.00000 | 1300.00000 | 1.00000 | 1.00000 | 1872.00000 | 1950.00000 | 0.00000 | 2.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 334.00000 | 0.00000 | 0.00000 | 334.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 2.00000 | 2.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 2006.00000 | 34900.00000 |
25% | 365.75000 | 20.00000 | 3.00000 | 7553.50000 | 5.00000 | 5.00000 | 1954.00000 | 1967.00000 | 0.00000 | 3.00000 | 3.00000 | 3.00000 | 3.00000 | 1.00000 | 1.00000 | 0.00000 | 1.00000 | 0.00000 | 223.00000 | 795.75000 | 3.00000 | 882.00000 | 0.00000 | 0.00000 | 1129.50000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 2.00000 | 1.00000 | 3.00000 | 5.00000 | 0.00000 | 0.00000 | 1.00000 | 58.00000 | 1.00000 | 1.00000 | 334.50000 | 3.00000 | 3.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 5.00000 | 2007.00000 | 129975.00000 |
50% | 730.50000 | 50.00000 | 3.00000 | 9478.50000 | 6.00000 | 5.00000 | 1973.00000 | 1994.00000 | 0.00000 | 3.00000 | 3.00000 | 4.00000 | 3.00000 | 1.00000 | 4.00000 | 383.50000 | 1.00000 | 0.00000 | 477.50000 | 991.50000 | 5.00000 | 1087.00000 | 0.00000 | 0.00000 | 1464.00000 | 0.00000 | 0.00000 | 2.00000 | 0.00000 | 3.00000 | 1.00000 | 3.00000 | 6.00000 | 1.00000 | 2.00000 | 5.00000 | 77.00000 | 2.00000 | 2.00000 | 480.00000 | 3.00000 | 3.00000 | 0.00000 | 25.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 6.00000 | 2008.00000 | 163000.00000 |
75% | 1095.25000 | 70.00000 | 3.00000 | 11601.50000 | 7.00000 | 6.00000 | 2000.00000 | 2004.00000 | 164.25000 | 4.00000 | 3.00000 | 4.00000 | 3.00000 | 2.00000 | 6.00000 | 712.25000 | 1.00000 | 0.00000 | 808.00000 | 1298.25000 | 5.00000 | 1391.25000 | 728.00000 | 0.00000 | 1776.75000 | 1.00000 | 0.00000 | 2.00000 | 1.00000 | 3.00000 | 1.00000 | 4.00000 | 7.00000 | 1.00000 | 4.00000 | 5.00000 | 101.00000 | 2.00000 | 2.00000 | 576.00000 | 3.00000 | 3.00000 | 168.00000 | 68.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 8.00000 | 2009.00000 | 214000.00000 |
max | 1460.00000 | 190.00000 | 6.00000 | 215245.00000 | 10.00000 | 9.00000 | 2010.00000 | 2010.00000 | 1600.00000 | 5.00000 | 5.00000 | 5.00000 | 4.00000 | 4.00000 | 6.00000 | 5644.00000 | 6.00000 | 1474.00000 | 2336.00000 | 6110.00000 | 5.00000 | 4692.00000 | 2065.00000 | 572.00000 | 5642.00000 | 3.00000 | 2.00000 | 3.00000 | 2.00000 | 8.00000 | 3.00000 | 5.00000 | 14.00000 | 3.00000 | 5.00000 | 6.00000 | 110.00000 | 3.00000 | 4.00000 | 1418.00000 | 5.00000 | 5.00000 | 857.00000 | 547.00000 | 552.00000 | 508.00000 | 480.00000 | 738.00000 | 4.00000 | 15500.00000 | 12.00000 | 2010.00000 | 755000.00000 |
One hot 编码
1 | train_df.shape, test_df.shape |
((1460, 77), (1459, 76))
1 | def preprocess_onehot_feature(dataset): |
1 | train_onehot_df = preprocess_onehot_feature(train_df) |
1 | print(train_onehot_df.shape, test_onehot_df.shape) |
(1460, 168) (1459, 153)
{'Condition2_RRAe',
'Condition2_RRAn',
'Condition2_RRNn',
'Electrical_Mix',
'Exterior1st_ImStucc',
'Exterior1st_Stone',
'Exterior2nd_Other',
'Heating_Floor',
'Heating_OthW',
'HouseStyle_2.5Fin',
'RoofMatl_ClyTile',
'RoofMatl_Membran',
'RoofMatl_Metal',
'RoofMatl_Roll',
'Utilities_NoSeWa'}
又遇到个坑,测试集比训练集的情况少。
1 | # 生成一个缺失的 dagafarm |
1 | print(train_onehot_df.shape, test_onehot_df.shape) |
(1460, 168) (1459, 168)
把 新 one hot 编码的 feature 添加到 原集合中,并删除原来的 feature
1 | # 把one hot的feature添加到训练集和测试集 |
1 | train_df.describe() |
Id | MSSubClass | MSZoning | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | ExterQual | ExterCond | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | HeatingQC | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | ... | Exterior1st_BrkFace | Exterior1st_CBlock | Exterior1st_CemntBd | Exterior1st_HdBoard | Exterior1st_ImStucc | Exterior1st_MetalSd | Exterior1st_Plywood | Exterior1st_Stone | Exterior1st_Stucco | Exterior1st_VinylSd | Exterior1st_Wd Sdng | Exterior1st_WdShing | RoofMatl_ClyTile | RoofMatl_CompShg | RoofMatl_Membran | RoofMatl_Metal | RoofMatl_Roll | RoofMatl_Tar&Grv | RoofMatl_WdShake | RoofMatl_WdShngl | Electrical_FuseA | Electrical_FuseF | Electrical_FuseP | Electrical_Mix | Electrical_SBrkr | SaleCondition_Abnorml | SaleCondition_AdjLand | SaleCondition_Alloca | SaleCondition_Family | SaleCondition_Normal | SaleCondition_Partial | Condition1_Artery | Condition1_Feedr | Condition1_Norm | Condition1_PosA | Condition1_PosN | Condition1_RRAe | Condition1_RRAn | Condition1_RRNe | Condition1_RRNn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | ... | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 | 1460.00000 |
mean | 730.50000 | 56.89726 | 2.82534 | 10516.82808 | 6.09932 | 5.57534 | 1971.26781 | 1984.86575 | 103.11712 | 3.39589 | 3.08356 | 3.48904 | 2.93493 | 1.63014 | 3.54589 | 443.63973 | 1.24726 | 46.54932 | 567.24041 | 1057.42945 | 4.14521 | 1162.62671 | 346.99247 | 5.84452 | 1515.46370 | 0.42534 | 0.05753 | 1.56507 | 0.38288 | 2.86644 | 1.04658 | 3.51164 | 6.51781 | 0.61301 | 1.82534 | 3.51438 | 74.15068 | 1.71575 | 1.76712 | 472.98014 | ... | 0.03425 | 0.00068 | 0.04178 | 0.15205 | 0.00068 | 0.15068 | 0.07397 | 0.00137 | 0.01712 | 0.35274 | 0.14110 | 0.01781 | 0.00068 | 0.98219 | 0.00068 | 0.00068 | 0.00068 | 0.00753 | 0.00342 | 0.00411 | 0.06438 | 0.01849 | 0.00205 | 0.00068 | 0.91438 | 0.06918 | 0.00274 | 0.00822 | 0.01370 | 0.82055 | 0.08562 | 0.03288 | 0.05548 | 0.86301 | 0.00548 | 0.01301 | 0.00753 | 0.01781 | 0.00137 | 0.00342 |
std | 421.61001 | 42.30057 | 1.02017 | 9981.26493 | 1.38300 | 1.11280 | 30.20290 | 20.64541 | 180.73137 | 0.57428 | 0.35105 | 0.87648 | 0.55216 | 1.06739 | 2.10778 | 456.09809 | 0.89233 | 161.31927 | 441.86696 | 438.70532 | 0.95950 | 386.58774 | 436.52844 | 48.62308 | 525.48038 | 0.51891 | 0.23875 | 0.55092 | 0.50289 | 0.81578 | 0.22034 | 0.66376 | 1.62539 | 0.64467 | 1.81088 | 1.93321 | 29.98205 | 0.89283 | 0.74732 | 213.80484 | ... | 0.18192 | 0.02617 | 0.20016 | 0.35920 | 0.02617 | 0.35786 | 0.26182 | 0.03700 | 0.12978 | 0.47799 | 0.34824 | 0.13230 | 0.02617 | 0.13230 | 0.02617 | 0.02617 | 0.02617 | 0.08650 | 0.05844 | 0.06400 | 0.24552 | 0.13477 | 0.04530 | 0.02617 | 0.27989 | 0.25384 | 0.05229 | 0.09032 | 0.11628 | 0.38386 | 0.27989 | 0.17837 | 0.22899 | 0.34395 | 0.07385 | 0.11337 | 0.08650 | 0.13230 | 0.03700 | 0.05844 |
min | 1.00000 | 20.00000 | 0.00000 | 1300.00000 | 1.00000 | 1.00000 | 1872.00000 | 1950.00000 | 0.00000 | 2.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 334.00000 | 0.00000 | 0.00000 | 334.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 2.00000 | 2.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | ... | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
25% | 365.75000 | 20.00000 | 3.00000 | 7553.50000 | 5.00000 | 5.00000 | 1954.00000 | 1967.00000 | 0.00000 | 3.00000 | 3.00000 | 3.00000 | 3.00000 | 1.00000 | 1.00000 | 0.00000 | 1.00000 | 0.00000 | 223.00000 | 795.75000 | 3.00000 | 882.00000 | 0.00000 | 0.00000 | 1129.50000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 2.00000 | 1.00000 | 3.00000 | 5.00000 | 0.00000 | 0.00000 | 1.00000 | 58.00000 | 1.00000 | 1.00000 | 334.50000 | ... | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
50% | 730.50000 | 50.00000 | 3.00000 | 9478.50000 | 6.00000 | 5.00000 | 1973.00000 | 1994.00000 | 0.00000 | 3.00000 | 3.00000 | 4.00000 | 3.00000 | 1.00000 | 4.00000 | 383.50000 | 1.00000 | 0.00000 | 477.50000 | 991.50000 | 5.00000 | 1087.00000 | 0.00000 | 0.00000 | 1464.00000 | 0.00000 | 0.00000 | 2.00000 | 0.00000 | 3.00000 | 1.00000 | 3.00000 | 6.00000 | 1.00000 | 2.00000 | 5.00000 | 77.00000 | 2.00000 | 2.00000 | 480.00000 | ... | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
75% | 1095.25000 | 70.00000 | 3.00000 | 11601.50000 | 7.00000 | 6.00000 | 2000.00000 | 2004.00000 | 164.25000 | 4.00000 | 3.00000 | 4.00000 | 3.00000 | 2.00000 | 6.00000 | 712.25000 | 1.00000 | 0.00000 | 808.00000 | 1298.25000 | 5.00000 | 1391.25000 | 728.00000 | 0.00000 | 1776.75000 | 1.00000 | 0.00000 | 2.00000 | 1.00000 | 3.00000 | 1.00000 | 4.00000 | 7.00000 | 1.00000 | 4.00000 | 5.00000 | 101.00000 | 2.00000 | 2.00000 | 576.00000 | ... | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
max | 1460.00000 | 190.00000 | 6.00000 | 215245.00000 | 10.00000 | 9.00000 | 2010.00000 | 2010.00000 | 1600.00000 | 5.00000 | 5.00000 | 5.00000 | 4.00000 | 4.00000 | 6.00000 | 5644.00000 | 6.00000 | 1474.00000 | 2336.00000 | 6110.00000 | 5.00000 | 4692.00000 | 2065.00000 | 572.00000 | 5642.00000 | 3.00000 | 2.00000 | 3.00000 | 2.00000 | 8.00000 | 3.00000 | 5.00000 | 14.00000 | 3.00000 | 5.00000 | 6.00000 | 110.00000 | 3.00000 | 4.00000 | 1418.00000 | ... | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
8 rows × 221 columns
1 | train_df.info() |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Columns: 221 entries, Id to Condition1_RRNn
dtypes: float64(182), int64(39)
memory usage: 2.5 MB
终于处理完了,已经没有 object 类型的数据了。接下来就能建模了。
模型预测
1 | train_df.sample(frac=1) |
((1460, 219), (1460,), (1459, 219))
先用逻辑回归看看
1 | X_train |
Id | MSSubClass | MSZoning | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | ExterQual | ExterCond | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | HeatingQC | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | ... | Exterior1st_BrkFace | Exterior1st_CBlock | Exterior1st_CemntBd | Exterior1st_HdBoard | Exterior1st_ImStucc | Exterior1st_MetalSd | Exterior1st_Plywood | Exterior1st_Stone | Exterior1st_Stucco | Exterior1st_VinylSd | Exterior1st_Wd Sdng | Exterior1st_WdShing | RoofMatl_ClyTile | RoofMatl_CompShg | RoofMatl_Membran | RoofMatl_Metal | RoofMatl_Roll | RoofMatl_Tar&Grv | RoofMatl_WdShake | RoofMatl_WdShngl | Electrical_FuseA | Electrical_FuseF | Electrical_FuseP | Electrical_Mix | Electrical_SBrkr | SaleCondition_Abnorml | SaleCondition_AdjLand | SaleCondition_Alloca | SaleCondition_Family | SaleCondition_Normal | SaleCondition_Partial | Condition1_Artery | Condition1_Feedr | Condition1_Norm | Condition1_PosA | Condition1_PosN | Condition1_RRAe | Condition1_RRAn | Condition1_RRNe | Condition1_RRNn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | 3.00000 | 8450 | 7 | 5 | 2003 | 2003 | 196.00000 | 4 | 3 | 4.00000 | 3.00000 | 1.00000 | 6.00000 | 706 | 1.00000 | 0 | 150 | 856 | 5 | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | 4 | 8 | 0 | 0.00000 | 5.00000 | 103.00000 | 2.00000 | 2 | 548 | ... | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
1 | 2 | 20 | 3.00000 | 9600 | 6 | 8 | 1976 | 1976 | 0.00000 | 3 | 3 | 4.00000 | 3.00000 | 4.00000 | 5.00000 | 978 | 1.00000 | 0 | 284 | 1262 | 5 | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | 3 | 6 | 1 | 3.00000 | 5.00000 | 76.00000 | 2.00000 | 2 | 460 | ... | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
2 | 3 | 60 | 3.00000 | 11250 | 7 | 5 | 2001 | 2002 | 162.00000 | 4 | 3 | 4.00000 | 3.00000 | 2.00000 | 6.00000 | 486 | 1.00000 | 0 | 434 | 920 | 5 | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | 4 | 6 | 1 | 3.00000 | 5.00000 | 101.00000 | 2.00000 | 2 | 608 | ... | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
3 | 4 | 70 | 3.00000 | 9550 | 7 | 5 | 1915 | 1970 | 0.00000 | 3 | 3 | 3.00000 | 4.00000 | 1.00000 | 5.00000 | 216 | 1.00000 | 0 | 540 | 756 | 4 | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | 4 | 7 | 1 | 4.00000 | 1.00000 | 98.00000 | 1.00000 | 3 | 642 | ... | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
4 | 5 | 60 | 3.00000 | 14260 | 8 | 5 | 2000 | 2000 | 350.00000 | 4 | 3 | 4.00000 | 3.00000 | 3.00000 | 6.00000 | 655 | 1.00000 | 0 | 490 | 1145 | 5 | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | 4 | 9 | 1 | 3.00000 | 5.00000 | 100.00000 | 2.00000 | 3 | 836 | ... | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1455 | 1456 | 60 | 3.00000 | 7917 | 6 | 5 | 1999 | 2000 | 0.00000 | 3 | 3 | 4.00000 | 3.00000 | 1.00000 | 1.00000 | 0 | 1.00000 | 0 | 953 | 953 | 5 | 953 | 694 | 0 | 1647 | 0 | 0 | 2 | 1 | 3 | 1 | 3 | 7 | 1 | 3.00000 | 5.00000 | 99.00000 | 2.00000 | 2 | 460 | ... | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
1456 | 1457 | 20 | 3.00000 | 13175 | 6 | 6 | 1978 | 1988 | 119.00000 | 3 | 3 | 4.00000 | 3.00000 | 1.00000 | 5.00000 | 790 | 3.00000 | 163 | 589 | 1542 | 3 | 2073 | 0 | 0 | 2073 | 1 | 0 | 2 | 0 | 3 | 1 | 3 | 7 | 2 | 3.00000 | 5.00000 | 78.00000 | 1.00000 | 2 | 500 | ... | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
1457 | 1458 | 70 | 3.00000 | 9042 | 7 | 9 | 1941 | 2006 | 0.00000 | 5 | 4 | 3.00000 | 4.00000 | 1.00000 | 6.00000 | 275 | 1.00000 | 0 | 877 | 1152 | 5 | 1188 | 1152 | 0 | 2340 | 0 | 0 | 2 | 0 | 4 | 1 | 4 | 9 | 2 | 4.00000 | 5.00000 | 41.00000 | 2.00000 | 1 | 252 | ... | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
1458 | 1459 | 20 | 3.00000 | 9717 | 5 | 6 | 1950 | 1996 | 0.00000 | 3 | 3 | 3.00000 | 3.00000 | 2.00000 | 6.00000 | 49 | 3.00000 | 1029 | 0 | 1078 | 4 | 1078 | 0 | 0 | 1078 | 1 | 0 | 1 | 0 | 2 | 1 | 4 | 5 | 0 | 0.00000 | 5.00000 | 50.00000 | 1.00000 | 1 | 240 | ... | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
1459 | 1460 | 20 | 3.00000 | 9937 | 5 | 6 | 1965 | 1965 | 0.00000 | 4 | 3 | 3.00000 | 3.00000 | 1.00000 | 4.00000 | 830 | 2.00000 | 290 | 136 | 1256 | 4 | 1256 | 0 | 0 | 1256 | 1 | 0 | 1 | 1 | 3 | 1 | 3 | 6 | 0 | 0.00000 | 5.00000 | 65.00000 | 3.00000 | 1 | 276 | ... | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
1460 rows × 220 columns
1 | # 逻辑回归 |
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
"this warning.", FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning)
90.55
1 | # Random Forest |
100.0
1 |
|
1 | plt.figure(figsize=(30, 5)) |
1 | plt.figure(figsize=(30, 5)) |
2231563230.032489
1 | np.array([range(len(pred))]*10).T |
array([[ 0, 0, 0, ..., 0, 0, 0],
[ 1, 1, 1, ..., 1, 1, 1],
[ 2, 2, 2, ..., 2, 2, 2],
...,
[457, 457, 457, ..., 457, 457, 457],
[458, 458, 458, ..., 458, 458, 458],
[459, 459, 459, ..., 459, 459, 459]])
1 | # Support Vector Machines |
/usr/local/lib/python3.6/dist-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
"avoid this warning.", FutureWarning)
99.52
1 | # 保存结果 |