lassoBasic

Syntax

lassoBasic(Y, X, [mode=0], [alpha=1.0], [intercept=true], [normalize=false], [maxIter=1000], [tolerance=0.0001], [positive=false], [swColName], [checkInput=true])

Details

Perform lasso regression.

Minimize the following objective function:

\[\frac{1}{2*n_-samples} \lVert y - Xw \rVert_2^2 + \alpha \lVert w \rVert_1\]

Arguments

Y is a numeric vector indicating the dependent variables.

X is a numeric vector/tuple/matrix/table, indicating the independent variables.

  • When X is a vector/tuple, its length must be equal to the length of Y.

  • When X is a matrix/table, its number of rows must be equal to the length of Y.

intercept is a Boolean variable indicating whether the regression includes the intercept. If it is true, the system automatically adds a column of 1’s to X to generate the intercept. The default value is true.

mode is an integer that can take the following three values:

  • 0 (default) : a vector of the coefficient estimates.

  • 1: a table with coefficient estimates, standard error, t-statistics, and p-values.

  • 2: a dictionary with the following keys: ANOVA, RegressionStat, Coefficient and Residual

ANOVA (one-way analysis of variance)

Source of Variance

DF (degree of freedom)

SS (sum of square)

MS (mean of square)

F (F-score)

Significance

Regression

p

sum of squares regression, SSR

regression mean square, MSR=SSR/R

MSR/MSE

p-value

Residual

n-p-1

sum of squares error, SSE

mean square error, MSE=MSE/E

Total

n-1

sum of squares total, SST

RegressionStat (Regression statistics)

Item

Description

R2

R-squared

AdjustedR2

The adjusted R-squared corrected based on the degrees of freedom by comparing the sample size to the number of terms in the regression model.

StdError

The residual standard error/deviation corrected based on the degrees of freedom.

Observations

The sample size.

Coefficient

Item

Description

factor

Independent variables

beta

Estimated regression coefficients

StdError

Standard error of the regression coefficients

tstat

t statistic, indicating the significance of the regression coefficients

Residual: the difference between each predicted value and the actual value.

alpha is a floating number representing the constant that multiplies the L1-norm. The default value is 1.0.

intercept is a Boolean value indicating whether to include the intercept in the regression. The default value is true.

normalize is a Boolean value. If true, the regressors will be normalized before regression by subtracting the mean and dividing by the L2-norm. If intercept =false, this parameter will be ignored. The default value is false.

maxIter is a positive integer indicating the maximum number of iterations. The default value is 1000.

tolerance is a floating number. The iterations stop when the improvement in the objective function value is smaller than tolerance. The default value is 0.0001.

positive is a Boolean value indicating whether to force the coefficient estimates to be positive. The default value is false.

swColName is a STRING indicating a column name of ds. The specified column is used as the sample weight. If it is not specified, the sample weight is treated as 1.

checkInput is a BOOLEAN value. It determines whether to enable validation check for parameters yColName, xColNames, and swColName.

  • If checkInput = true (default), it will check the invalid value for parameters and throw an error if the NULL value exists.

  • If checkInput = false, the invalid value is not checked.

It is recommended to specify checkInput = true. If it is false, it must be ensured that there are no invalid values in the input parameters and no invalid values are generated during intermediate calculations, otherwise the returned model may be inaccurate.

Examples

$ x1=1 3 5 7 11 16 23
$ x2=2 8 11 34 56 54 100
$ y=0.1 4.2 5.6 8.8 22.1 35.6 77.2;

$ print(lassoBasic(y, (x1,x2), mode = 0));
[-9.133706333069543,2.535935196073186,0.189298948643987]


$ print(lassoBasic(y, (x1,x2), mode = 1));
factor    beta               stdError          tstat              pvalue
--------- ------------------ ----------------- ------------------ -----------------
intercept -9.133706333069543 5.247492365971091 -1.740584968222107 0.156730846105191
x1        2.535935196073186  1.835793667840723 1.38138356205138   0.239309472176311
x2        0.189298948643987  0.410201227095842 0.461478260277749  0.66843504931137


$ print(lassoBasic(y, (x1,x2), mode = 2));
Coefficient->
factor    beta               stdError          tstat              pvalue
--------- ------------------ ----------------- ------------------ -----------------
intercept -9.133706333069543 5.247492365971091 -1.740584968222107 0.156730846105191
x1        2.535935196073186  1.835793667840723 1.38138356205138   0.239309472176311
x2        0.189298948643987  0.410201227095842 0.461478260277749  0.66843504931137

RegressionStat->
item         statistics
------------ -----------------
R2           0.931480447323074
AdjustedR2   0.897220670984611
StdError     8.195817208870076
Observations 7

ANOVA->
Breakdown  DF SS                   MS                   F                  Significance
---------- -- -------------------- -------------------- ------------------ -----------------
Regression 2  4165.242566095043912 2082.621283047521956 31.004574440904473 0.003672076469395
Residual   4  268.685678884843582  67.171419721210895
Total      6  4471.637142857141952

Residual->
[6.319173239708383,4.21150915569809,-0.028258082380245,-6.254004293338318,-7.262321947798779,-6.063400030876729,9.077301958987561]