问答网站问题、回答数量预测

标杆:问答网站问题、回答数量预测     

一元线性回归模型(Python)

该模型预测结果的平均MAPE为:0.09190

该模型只用id作为一元线性模型的自变量。
我们可以用任何现成的函数完成,也可以自己动手写一个随机梯度下降法来得到回归系数。




星期交叉项回归模型(Python)

该模型预测结果的平均MAPE为:0.04531

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# 读取数据
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submit = pd.read_csv("sample_submit.csv")

# 取出真实值:questions和answers
q_train = train.pop('questions')
a_train = train.pop('answers')

# 把date转为时间格式,得到星期,再进行独热处理
train['date'] = pd.to_datetime(train['date'])
train['dayofweek'] = train['date'].dt.dayofweek
train = pd.get_dummies(train, columns=['dayofweek'])
test['date'] = pd.to_datetime(test['date'])
test['dayofweek'] = test['date'].dt.dayofweek
test = pd.get_dummies(test, columns=['dayofweek'])

# 插入id与星期的交叉相,一共得到7项
for i in range(7):
    train['id_dayofweek_%s'%i] = train['id'] * train['dayofweek_%s'%i]
    test['id_dayofweek_%s'%i] = test['id'] * test['dayofweek_%s'%i]

# 去掉date这一列
train.drop('date', axis=1, inplace=True)
test.drop('date', axis=1, inplace=True)

# 建立多变量线性回归模型并进行预测

# 预测questions
reg = LinearRegression()
reg.fit(train, q_train)
q_pred = reg.predict(test)

# 预测answers
reg = LinearRegression()
reg.fit(train, a_train)
a_pred = reg.predict(test)

# 输出预测结果至my_LR_prediction.csv
submit['questions'] = q_pred
submit['answers'] = a_pred
submit.to_csv('my_LR_prediction.csv', index=False)



线性回归k近邻混合模型(Python)

该模型预测结果的平均MAPE为:0.03170

# -*- coding: utf-8 -*-

import pandas as pd
from sklearn.linear_model import LinearRegression

# 读取数据
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submit = pd.read_csv("sample_submit.csv")

# 构造非线性特征
cols_lr = ['id', 'sqrt_id']
train['sqrt_id'] = np.sqrt(train['id'])
test['sqrt_id'] = np.sqrt(test['id'])
    
# 构造星期、月、年特征
train['date'] = pd.to_datetime(train['date'])
train['d_w'] = train['date'].dt.dayofweek
train['d_m'] = train['date'].dt.month
train['d_y'] = train['date'].dt.year
test['date'] = pd.to_datetime(test['date'])
test['d_w'] = test['date'].dt.dayofweek
test['d_m'] = test['date'].dt.month
test['d_y'] = test['date'].dt.year
cols_knn = ['d_w', 'd_m', 'd_y']

# 根据特征['id', 'sqrt_id'],构造线性模型预测questions
reg = LinearRegression()
reg.fit(train[cols_lr], train['questions'])
q_fit = reg.predict(train[cols_lr])
q_pred = reg.predict(test[cols_lr])

# 根据特征['id', 'sqrt_id'],构造线性模型预测answers
reg = LinearRegression()
reg.fit(train[cols_lr], train['answers'])
a_fit = reg.predict(train[cols_lr])
a_pred = reg.predict(test[cols_lr])

# 得到questions和answers的训练误差
q_diff = train['questions'] - q_fit
a_diff = train['answers'] - a_fit

# 把训练误差作为新的目标值,使用特征cols_knn,建立kNN模型
from sklearn.neighbors import KNeighborsRegressor
reg = KNeighborsRegressor()
reg.fit(train[cols_knn], q_diff)
q_pred_knn = reg.predict(test[cols_knn])
reg = KNeighborsRegressor()
reg.fit(train[cols_knn], a_diff)
a_pred_knn = reg.predict(test[cols_knn])

#输出预测结果至my_Lr_Knn_prediction.csv
submit['questions'] = q_pred + q_pred_knn
submit['answers'] = a_pred + a_pred_knn
submit.to_csv('my_Lr_Knn_prediction.csv', index=False)