问题描述
PROBNORM:解释
PROBNORM : explanation
SAS 中的 PROBNORM 函数返回标准正态分布的观测值小于或等于 x 的概率.
The PROBNORM function in SAS returns the probability that an observation from the standard normal distribution is less than or equal to x.
pyspark中有没有等价的功能?
Is there any equivalent function in pyspark?
推荐答案
恐怕PySpark中没有这样的实现方法.
但是,您可以利用 Pandas UDF 使用基本的 Python 包定义您自己的自定义函数!这里我们将使用 scipy.stats.norm 模块从标准正态分布中获取累积概率.
I'm afraid that in PySpark there is no such implemented method.
However, you can exploit Pandas UDFs to define your own custom function using basic Python packages! Here we are going to use scipy.stats.norm module to get cumulative probabilities from a standard normal distribution.
我正在使用的版本:
- Spark 3.1.1
- 熊猫 1.1.5
- scipy 1.5.2
示例代码
import pandas as pd from scipy.stats import norm import pyspark.sql.functions as F from pyspark.sql.functions import pandas_udf # create sample data df = spark.createDataFrame([ (1, 0.00), (2, -1.23), (3, 4.56), ], ['id', 'value']) # define your custom Pandas UDF @pandas_udf('double') def probnorm(s: pd.Series) -> pd.Series: return pd.Series(norm.cdf(s)) # create a new column using the Pandas UDF df = df.withColumn('pnorm', probnorm(F.col('value'))) df.show() +---+-----+-------------------+ | id|value| pnorm| +---+-----+-------------------+ | 1| 0.0| 0.5| | 2|-1.23|0.10934855242569191| | 3| 4.56| 0.9999974423189606| +---+-----+-------------------+
编辑
如果您的工作人员也没有正确安装 scipy,您可以使用 Python 基础包 math 和一点 统计知识.
Edit
If you do not have scipy properly installed on your workers too, you can use the Python base package math and a little bit of statistics knowledge.
import math from pyspark.sql.functions import udf def normal_cdf(x, mu=0, sigma=1): """ Cumulative distribution function for the normal distribution with mean `mu` and standard deviation `sigma` """ return (1 + math.erf((x - mu) / (sigma * math.sqrt(2)))) / 2 my_udf = udf(normal_cdf) df = df.withColumn('pnorm', my_udf(F.col('value'))) df.show() +---+-----+-------------------+ | id|value| pnorm| +---+-----+-------------------+ | 1| 0.0| 0.5| | 2|-1.23|0.10934855242569197| | 3| 4.56| 0.9999974423189606| +---+-----+-------------------+
结果其实是一样的.