regroup

Syntax

regroup(X, label, func, [byRow=true])

Arguments

X is a matrix.

label is a vector indicating the column/row labels based on which the matrix is grouped and aggregated. When byRow = true, the length of label must match the number of rows of X. Otherwise, it must match the number of columns of X.

func is a unary aggregate function called on each group of the matrix. It can be built-in or user-defined.

byRow is a Boolean. The default value is true, indicating that the matrix will be grouped and aggregated by rows. False means to group and aggregate matrix by columns.

Details

Group the data of a matrix based on user-specified column/row labels and apply aggregation on each group.

regroup is similar to the SQL keyword “group by”, except that “group by” is applied only on tables whereas this function is applied on matrices.

Note: It is recommended that the func parameter be specified as a built-in aggregate function as built-in functions are optimized internally for optimal performance. (see Example 2)

Examples

Example 1. Perform grouped aggregation on a matrix by row/column labels.

$ m = rand(20, 4:5)
$ m;

col1	col2	col3	col4	col5
11	6	6	10	4
6	7	5	2	16
2	16	14	19	9
17	6	13	10	2

// by column labels
$ label = `A`A`B`A`B
$ regroup(X=m, label=label, func=firstNot, byRow=false)

A	B
11	6
6	5
2	14
17	13

// by row labels
 $ label = 1 2 1 2
 $ regroup(X=m, label=label, func=firstNot, byRow=true)

label	col1	col2	col3	col4	col5
1	11	6	6	10	4
2	6	7	5	2	16

Example 2. Compare the performance between applying a built-in function and a user-defined function.

$ m = rand(1000.0, 10000)$100:100
$ defg my_avg(v):avg(v)

$ timer(1000) regroup(m, take(1 2 3 4 5, 100), avg)
Time elapsed: 176.175 ms

$ timer(1000) regroup(m, take(1 2 3 4 5, 100), my_avg)
Time elapsed: 1062.553 ms

Example 3. Aggregate the panel data by minute

$ n=1000
$ timestamp = 09:00:00 + rand(10000, n).sort!()
$ id = take(`st1`st2`st3, n)
$ vol = 100 + rand(10.0, n)
$ vt = table(timestamp, id, vol)
$ m = exec vol from vt pivot by timestamp, id
$ regroup(m, minute(m.rowNames()), avg)