# 导入spark中的mllib的推荐库 import pyspark.mllib.recommendation as rd
生成Rating类的RDD数据
# 由于ALS模型需要由Rating记录构成的RDD作为参数,因此这里用rd.Rating方法封装数据 ratings = rawRatings.map(lambda line: rd.Rating(int(line[0]), int(line[1]), float(line[2]))) ratings.first()
Rating(user=196, product=242, rating=3.0)
训练ALS模型
rank: 对应ALS模型中的因子个数,即矩阵分解出的两个矩阵的新的行/列数,即A≈UVT,k<<m,nm,n中的k
iterations: 对应运行时的最大迭代次数
lambda: 控制模型的正则化过程,从而控制模型的过拟合情况。
# 训练ALS模型 model = rd.ALS.train(ratings, 50, 10, 0.01) model
<pyspark.mllib.recommendation.MatrixFactorizationModel at 0x7f53cc29c710>
# 对用户789预测其对电影123的评级 predictedRating = model.predict(789,123) predictedRating
3.1740832151065774
# 获取对用户789的前10推荐 topKRecs = model.recommendProducts(789,10) topKRecs
[Rating(user=789, product=429, rating=6.302989890089658), Rating(user=789, product=496, rating=5.782039583864358), Rating(user=789, product=651, rating=5.665266358968961), Rating(user=789, product=250, rating=5.551256887914674), Rating(user=789, product=64, rating=5.5336980239740186), Rating(user=789, product=603, rating=5.468600343790217), Rating(user=789, product=317, rating=5.442052952711695), Rating(user=789, product=480, rating=5.414042111530209), Rating(user=789, product=180, rating=5.413309515550101), Rating(user=789, product=443, rating=5.400024900653429)]
检查推荐内容
这里首先将电影的数据读入,讲数据处理为电影ID到标题的映射
然后获取某个用户评级前10的影片同推荐这个用户的前10影片进行比较
#检查推荐内容 movies = sc.textFile('/home/null/hadoop/data/ml-100k/u.item') movies.first()
'1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0'
titles_data= movies.map(lambda line: line.split("|")[:2]).collect() titles = dict(titles_data) titles
moviesForUser = ratings.keyBy(lambda rating: rating.user).lookup(789) type(moviesForUser)
list
moviesForUser = sorted(moviesForUser,key=lambda r: r.rating, reverse=True)[0:10] moviesForUser
[Rating(user=789, product=127, rating=5.0), Rating(user=789, product=475, rating=5.0), Rating(user=789, product=9, rating=5.0), Rating(user=789, product=50, rating=5.0), Rating(user=789, product=150, rating=5.0), Rating(user=789, product=276, rating=5.0), Rating(user=789, product=129, rating=5.0), Rating(user=789, product=100, rating=5.0), Rating(user=789, product=741, rating=5.0), Rating(user=789, product=1012, rating=4.0)]
[(titles[str(r.product)], r.rating) for r in moviesForUser]
[('Godfather, The (1972)', 5.0), ('Trainspotting (1996)', 5.0), ('Dead Man Walking (1995)', 5.0), ('Star Wars (1977)', 5.0), ('Swingers (1996)', 5.0), ('Leaving Las Vegas (1995)', 5.0), ('Bound (1996)', 5.0), ('Fargo (1996)', 5.0), ('Last Supper, The (1995)', 5.0), ('Private Parts (1997)', 4.0)]
[(titles[str(r.product)], r.rating) for r in topKRecs]
[('Day the Earth Stood Still, The (1951)', 6.302989890089658), ("It's a Wonderful Life (1946)", 5.782039583864358), ('Glory (1989)', 5.665266358968961), ('Fifth Element, The (1997)', 5.551256887914674), ('Shawshank Redemption, The (1994)', 5.5336980239740186), ('Rear Window (1954)', 5.468600343790217), ('In the Name of the Father (1993)', 5.442052952711695), ('North by Northwest (1959)', 5.414042111530209), ('Apocalypse Now (1979)', 5.413309515550101), ('Birds, The (1963)', 5.400024900653429)]
推荐模型效果的评估
均方差(Mean Squared Error,MSE)
定义为各平方误差的和与总数目的商,其中平方误差是指预测到的评级与真实评级的差值平方
直接度量模型的预测目标变量的好坏
均方根误差(Root Mean Squared Error,RMSE)
对MSE取其平方根,即预计评级和实际评级的差值的标准差
# evaluation metric usersProducts = ratings.map(lambda r:(r.user, r.product)) predictions = model.predictAll(usersProducts).map(lambda r: ((r.user, r.product),r.rating)) predictions.first()
((316, 1084), 4.006135662882842)
ratingsAndPredictions = ratings.map(lambda r: ((r.user,r.product), r.rating)).join(predictions) ratingsAndPredictions.first()
((186, 302), (3.0, 2.7544572973050236))
# 使用MLlib内置的评估函数计算MSE,RMSE from pyspark.mllib.evaluation import RegressionMetrics predictionsAndTrue = ratingsAndPredictions.map(lambda line: (line[1][0],line[1][3])) predictionsAndTrue.first()
(3.0, 2.7544572973050236)
# MSE regressionMetrics = RegressionMetrics(predictionsAndTrue) regressionMetrics.meanSquaredError
0.08509832708963357
# RMSE regressionMetrics.rootMeanSquaredError
0.2917161755707653
参考:
原文链接:https://segmentfault.com/a/1190000012494851