玩命加载中...
## 三、小批量随机梯度下降法 正如上节中的结束语所说,小批量随机梯度下降法(Mini-batch Stochastic Gradient Decent)是对速度和稳定性进行妥协后的产物。 我们先回顾以一下全批量法是如何计算每次迭代中的梯度的: $$\nabla L = \left(\frac{\partial L}{\partial \beta\_0}, \frac{\partial L}{\partial \beta\_1}\right)=\left(\frac{2}{N}\sum\_{j=1}^N(\beta\_0+\beta\_1x\_j-\hat y\_j), \frac{2}{N}\sum\_{j=1}^Nx\_j(\beta\_0+\beta\_1x\_j-\hat y\_j)\right)$$ 以及随机梯度下降法是如何计算每次迭代中的梯度的: $$\nabla L = \left(\frac{\partial L}{\partial \beta\_0}, \frac{\partial L}{\partial \beta\_1}\right)=\left(2(\beta\_0+\beta\_1x\_r-\hat y\_r), 2x\_r(\beta\_0+\beta\_1x\_r-\hat y\_r)\right),$$ 其中$(x\_t, \hat y\_r)$是一个随机样本。 小批量随机梯度下降的关键思想是,我们不是随机使用一个样本,而是随机使用$b$个不同的样本。梯度的计算如下: $$\nabla L = \left(\frac{\partial L}{\partial \beta\_0}, \frac{\partial L}{\partial \beta\_1}\right)=\left(\frac{2}{b}\sum\_{r=1}^b(\beta\_0+\beta\_1x\_{j\_r}-\hat y\_{j\_r}), \frac{2}{b}\sum\_{r=1}^bx\_{j\_r}(\beta\_0+\beta\_1x\_{j\_r}-\hat y\_{j\_r})\right).$$ 我们可以看出当$b=1$时,小批量随机下降法就等价与SGD;当$b=N$时,小批量就等价于全批量。所以小批量梯度下降法的效果也和$b$的选择相关,这个数值被称为批量尺寸(batch size)。对于如何选择$b$,感兴趣的朋友可以参与**[这里的讨论](http://sofasofa.io/forum_main_post.php?postid=1000667)**。 下面我们来完成小批量梯度下降法中计算梯度的函数`compute_grad_batch(beta, batch_size, x, y)`。 ```python def compute_grad_batch(beta, batch_size, x, y): grad = [0, 0] r = np.random.choice(range(len(x)), batch_size, replace=False) grad[0] = 2. * np.mean(beta[0] + beta[1] * x[r] - y[r]) grad[1] = 2. * np.mean(x[r] * (beta[0] + beta[1] * x[r] - y[r])) return np.array(grad) ``` 先设置好`batch_size`,接下来我们就可以用小批量随机梯度下降法来计算了。 ```python # 引用模块 import pandas as pd import numpy as np # 导入数据 train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') submit = pd.read_csv('sample_submit.csv') # 初始设置 beta = [1, 1] alpha = 0.2 tol_L = 0.1 batch_size = 16 # 对x进行归一化 max_x = max(train['id']) x = train['id'] / max_x y = train['questions'] # 定义计算mini-batch随机梯度的函数 def compute_grad_batch(beta, batch_size, x, y): grad = [0, 0] r = np.random.choice(range(len(x)), batch_size, replace=False) grad[0] = 2. * np.mean(beta[0] + beta[1] * x[r] - y[r]) grad[1] = 2. * np.mean(x[r] * (beta[0] + beta[1] * x[r] - y[r])) return np.array(grad) # 定义更新beta的函数 def update_beta(beta, alpha, grad): new_beta = np.array(beta) - alpha * grad return new_beta # 定义计算RMSE的函数 def rmse(beta, x, y): squared_err = (beta[0] + beta[1] * x - y) ** 2 res = np.sqrt(np.mean(squared_err)) return res # 进行第一次计算 np.random.seed(10) grad = compute_grad_batch(beta, batch_size, x, y) loss = rmse(beta, x, y) beta = update_beta(beta, alpha, grad) loss_new = rmse(beta, x, y) # 开始迭代 i = 1 while np.abs(loss_new - loss) > tol_L: beta = update_beta(beta, alpha, grad) grad = compute_grad_batch(beta, batch_size, x, y) if i % 100 == 0: loss = loss_new loss_new = rmse(beta, x, y) print('Round %s Diff RMSE %s'%(i, abs(loss_new - loss))) i += 1 print('Coef: %s \nIntercept %s'%(beta[1], beta[0])) ``` Round 100 Diff RMSE 1441.03092609 Round 200 Diff RMSE 6.73060008207 Round 300 Diff RMSE 0.919979446475 Round 400 Diff RMSE 19.5313659659 Round 500 Diff RMSE 18.9636102461 Round 600 Diff RMSE 0.933603570602 ... Round 3900 Diff RMSE 1.72613234384 Round 4000 Diff RMSE 12.2329159056 Round 4100 Diff RMSE 0.34479013181 Round 4200 Diff RMSE 0.037618992278 Coef: 4960.29891935 Intercept 923.011781435 经过4200次迭代(对于全批量梯度下降法来说相当于是16*4200/2253=29.83次迭代),小批量梯度下降得到的模型系数为 ```python print('Our Coef: %s \nOur Intercept %s'%(beta[1] / max_x, beta[0])) ``` Our Coef: 2.20164177512 Our Intercept 923.011781435 训练误差RMSE为 ```python res = rmse(beta, x, y) print('Our RMSE: %s'%res) ``` Our RMSE: 531.887039197 上面的数值非常接近Sklearn RMSE: 531.841307949。 <ul class="pagination"> <li><a href="index.php">第1页</a></li> <li><a href="2.php">第2页</a></li> <li><a href="3.php">第3页</a></li> <li><a href="4.php">第4页</a></li> <li class="active"><a href="#">第5页</a></li> <li><a href="6.php">第6页</a></li> </ul> <ul class="pager"> <li class="previous"><a href="4.php"><b>&larr; 返回前一页</b></a></li> <li class="next"><a href="6.php"><b>进入下一页 &rarr;</b></a></li> </ul>