Spark内置逻辑斯谛回归LR实现(python)

1. Spark LR

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

import sklearn.datasets as datasets
import numpy as np
import time
from sklearn.linear_model import LogisticRegression as LR

def normalize(x):
    return (x - np.min(x))/(np.max(x) - np.min(x))
# input datasets 
X, y = datasets.make_blobs(n_samples=1000000, centers=10, 
n_features=10, random_state=0)

# 归一化
X_norm = normalize(X)
X_train = X_norm[:int(len(X_norm)*0.8)]
X_test = X_norm[int(len(X_norm)*0.8):]
y_train = y[:int(len(X_norm)*0.8)]
y_test = y[int(len(X_norm)*0.8):]

y_train = y_train.reshape(-1,1)
# spark df
df = np.concatenate([y_train,X_train], axis=1)
train_df = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)
spark_train = spark.createDataFrame(train_df,schema=["label", "features"])

test_df = map(lambda x: (Vectors.dense(x),), X_test)
spark_test = spark.createDataFrame(test_df,schema=["features"])

模型运行

# train model
st = time.time()

lr = LogisticRegression()#maxIter=10,regParam=0.001
pipeline = Pipeline(stages=[lr])
model = pipeline.fit(spark_train)

prediction = model.transform(spark_test)

# get acc
selected = prediction.select("prediction")
count = 0
for i,row in enumerate(selected.collect()):
    pred = row
    if pred == y_test[i]:
        count += 1
print("acc:{:.4f}".format(count/len(y_test)))

et = time.time()
print("used:{:.4f}".format(et-st))

运行结果:
30秒,因为只是在spark框架下运算,应该属于本地独立模式,如果进行分布式,可能会更快。

acc:1.0000
used:30.2634

2. sklearn 的LR

st = time.time()

def accuracy(pred, true):
    count = 0
    for i in range(len(pred)):
        if(pred[i] == true[i]):
            count += 1
    return count/len(pred)

# model 2
clf_lr = LR()
clf_lr.fit(X_train, y_train)
y_pred2 = clf_lr.predict(X_test)
print("acc2", accuracy(y_pred2, y_test))

et = time.time()
print("used:{:.4f}".format(et-st))

运行结果:

acc2 1.0
used:58.3716

在数据量较少时,spark运行时间更长,因为spark包含通讯时间,把数据量调到100万,才勉强体现spark的优势。

结论,就是在数据量足够大下,spark框架下的模型,运行速度会比普通实现的更快。

3. 并行计算LR原理

在这里插入图片描述
把权重矩阵W 拆分为多个小矩阵,进行同时计算。


参考:

  1. Spark 2.1.0 入门:构建一个机器学习工作流(Python版)
  2. 【机器学习算法系列之二】浅析Logistic Regression
  3. Github spark/examples/src/main/python/logistic_regression.py
已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 酷酷鲨 设计师:CSDN官方博客 返回首页