kmeans(X, k, [maxIter=300], [randomSeed], [init=’random’])


X is a table. Each row is an observation and each column is a feature.

k is a positive integer indicating the number of clusters to form.

maxIter is a positive integer indicating the maximum number of iterations of the k-means algorithm for a single run. The default value is 300.

randomSeed is an integer to determine random number generation for centroid intialization. The default value is NULL.

init is a STRING scalar or matrix indicating the optional method for initialization. The default value is “random“.

  • If init is a STRING scalar, it can be “random” or “k-means++”: “random” means to choose observations at random from data for the initial centroids; “k-means++” means to generate cluster centroids using the k-means++ algorithm.

  • If init is a matrix, it indicates the centroid starting locations. The number of columns is the same as X and the number of rows is k.


K-means clustering. Return a dictionary with the following keys:

  • centers: a k-by-m (m is the number of columns of X) matrix. Each row contains the coordinates of a cluster center.

  • predict: a clustering function for prediction of FUNCTIONDEF type.

  • modelName: string “KMeans”.

  • model: a variable of RESOURCE type. It is an internal binary resource to be used by function predict.

  • labels: a vector indicating to which cluster each row of X belongs.


k-means clustering on simulated data:

$ t = table(100:0, `x0`x1, [DOUBLE, DOUBLE])
$ x0 = norm(1.0, 1.0, 50)
$ x1 = norm(1.0, 1.5, 50)
$ insert into t values (x0, x1)
$ x0 = norm(2.0, 1.0, 50)
$ x1 = norm(-1.0, 1.5, 50)
$ insert into t values (x0, x1)
$ x0 = norm(-1.0, 1.0, 50)
$ x1 = norm(-3.0, 1.5, 50)
$ insert into t values (x0, x1);

$ model = kmeans(t, 3);
$ model;


#0        #1
--------- ---------
-1.048027 -3.809539
1.110899  1.24216
1.677974  -1.19158