玩命加载中...
下面把我们前面三步以及第四步综合起来写一个完整的程序。
```python
# 引用模块
import pandas as pd
import numpy as np
# 导入数据
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
submit = pd.read_csv('sample_submit.csv')
# 初始设置
beta = [1, 1]
alpha = 0.2
tol_L = 0.1
# 对x进行归一化
max_x = max(train['id'])
x = train['id'] / max_x
y = train['questions']
# 定义计算梯度的函数
def compute_grad(beta, x, y):
grad = [0, 0]
grad[0] = 2. * np.mean(beta[0] + beta[1] * x - y)
grad[1] = 2. * np.mean(x * (beta[0] + beta[1] * x - y))
return np.array(grad)
# 定义更新beta的函数
def update_beta(beta, alpha, grad):
new_beta = np.array(beta) - alpha * grad
return new_beta
# 定义计算RMSE的函数
def rmse(beta, x, y):
squared_err = (beta[0] + beta[1] * x - y) ** 2
res = np.sqrt(np.mean(squared_err))
return res
# 进行第一次计算
grad = compute_grad(beta, x, y)
loss = rmse(beta, x, y)
beta = update_beta(beta, alpha, grad)
loss_new = rmse(beta, x, y)
# 开始迭代
i = 1
while np.abs(loss_new - loss) > tol_L:
beta = update_beta(beta, alpha, grad)
grad = compute_grad(beta, x, y)
loss = loss_new
loss_new = rmse(beta, x, y)
i += 1
print('Round %s Diff RMSE %s'%(i, abs(loss_new - loss)))
print('Coef: %s \nIntercept %s'%(beta[1], beta[0]))
```
Round 2 Diff RMSE 984.983509929
Round 3 Diff RMSE 22.6533222671
Round 4 Diff RMSE 21.2748710284
Round 5 Diff RMSE 20.415520988
...
Round 115 Diff RMSE 0.11257335093
Round 116 Diff RMSE 0.106753598452
Round 117 Diff RMSE 0.101233641076
Round 118 Diff RMSE 0.0959981429022
Coef: 4796.26618876
Intercept 1015.70899949
经过118次迭代,达到收敛条件。
由于我们对`x`进行了归一化,上面得到的`Coef`其实是真实的系数乘以`max_x`。
我们可以还原得到最终的回归系数。
```python
print('Our Coef: %s \nOur Intercept %s'%(beta[1] / max_x, beta[0]))
```
Our Coef: 2.12883541445
Our Intercept 1015.70899949
以及训练误差RMSE
```python
res = rmse(beta, x, y)
print('Our RMSE: %s'%res)
```
Our RMSE: 533.598313974
我们可以用标准模块`sklearn.linear_model.LinearRegression`来检验我们通过梯度下降法得到的系数。
```python
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train[['id']], train[['questions']])
print('Sklearn Coef: %s'%lr.coef_[0][0])
print('Sklearn Coef: %s'%lr.intercept_[0])
```
Sklearn Coef: 2.19487084445
Sklearn Coef: 936.051219649
```python
res = rmse([936.051219649, 2.19487084], train['id'], y)
print('Sklearn RMSE: %s'%res)
```
Sklearn RMSE: 531.841307949
我们的RMSE以及系数和都和Sklearn的输出结果比较接近的!