group by

We use keyword group by followed by column name(s) to group the records and to apply functions on each group of records. The order of columns in the table generated by a SQL statement is: The grouping columns not specified by the select clause are displayed first, then follow the columns specified by both the group by clause and the select clause. Therefore unlike in standard SQL, we don’t need to include group by columns explicitly in select statements.

Examples

$ sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
$ price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
$ qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
$ timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
$ t1 = table(timestamp, sym, qty, price);

$ t1;

timestamp

sym

qty

price

09:34:07

C

2200

49.6

09:36:42

MS

1900

29.46

09:36:51

MS

2100

29.52

09:36:59

MS

3200

30.02

09:32:47

IBM

6800

174.97

09:35:26

IBM

5400

175.2

09:34:16

C

1300

50.76

09:34:26

C

2500

50.32

09:38:12

C

8800

51.29

select count(sym) as counts from t1 group by sym;

sym

counts

C

4

MS

3

IBM

2

select top 2:3 count(sym) as counts from t1 group by sym;

sym

counts

IBM

2

select max(price), sym from t1 group by sym, minute(timestamp)

minute_timestamp

max_price

sym

09:34m

50.76

C

09:36m

30.02

MS

09:32m

174.97

IBM

09:35m

175.23

IBM

09:38m

51.29

C

select avg(qty) from t1 group by sym;

sym

avg_qty

C

3700

MS

2400

IBM

6100

select wavg(price, qty) as vwap, sum(qty) from t1 group by sym;

sym

vwap

sum_qty

C

50.828378

14800

IBM

175.085082

12200

MS

29.726389

7200

select wsum(price, qty) as dollarVolume, sum(qty) from t1 group by minute(timestamp) as ts;

ts

dollarVolume

sum_qty

09:32m

1.189796e+006

6800

09:34m

300908

6000

09:35m

946242

5400

09:36m

214030

7200

09:38m

451352

8800

select sum(qty) from t1 group by sym, timestamp.minute() as minute;

sym

minute

sum_qty

C

09:34m

6000

C

09:38m

8800

IBM

09:32m

6800

IBM

09:35m

5400

MS

09:36m

7200

When group by is used with order by, the order by column(s) can be one or more of group by columns, or one or more of the new columns generated by group by calculation.

select sum(qty) from t1 group by sym, timestamp.minute() as minute order by minute;

sym

minute

sum_qty

C

09:00m

1100

IBM

09:32m

6800

C

09:34m

6000

IBM

09:35m

5400

MS

09:36m

7200

C

09:38m

8800

In the examples above, the result of the function for each group is a scalar. In other cases, a function may output a vector or a dictionary rather than a scalar. For examples, the function stat outputs a dictionary, and the function ols outputs a coefficient vector or a dictionary that includes parameter estimates, t-stat, R square, etc. To output the results in multiple columns, we need to convert the vector/dictionary output to multiple scalars. In other words, we need to convert a composite column into multiple columns. This can be accomplished with “as” keyword and a constant string vector for column names.

$ y=1..15
$ factor1=3.2 1.2 5.9 6.9 11.1 9.6 1.4 7.3 2.0 0.1 6.1 2.9 6.3 8.4 5.6
$ factor2=1.7 1.3 4.2 6.8 9.2 1.3 1.4 7.8 7.9 9.9 9.3 4.6 7.8 2.4 8.7
$ id=take(1 2 3, 15).sort();
$ t=table(id, y, factor1, factor2);

$ t;

id

y

factor1

factor2

1

1

3.2

1.7

1

2

1.2

1.3

1

3

5.9

4.2

1

4

6.9

6.8

1

5

11.1

9.2

2

6

9.6

1.3

2

7

1.4

1.4

2

8

7.3

7.8

2

9

2

7.9

2

10

0.1

9.9

3

11

6.1

9.3

3

12

2.9

4.6

3

13

6.3

7.8

3

14

8.4

2.4

3

15

5.6

8.7

select ols(y,(factor1,factor2),true,0) as `int`factor1`factor2 from t group by id;

id

int

factor1

factor2

1

1.063991

-0.258685

0.732795

2

6.886877

-0.148325

0.303584

3

11.833867

0.272352

-0.065526

select ols(y,(factor1,factor2),true,2).Coefficient.tstat[1:] as `t1`t2 from t group by id;

id

t1

t2

1

-0.891868

2.253451

2

-5.73315

11.433117

3

0.510866

-0.183903

To skip certain elements from the output of the function, we can leave the names of the corresponding elements empty.

select ols(y,(factor1,factor2),true,2).Coefficient.beta as ``factor1`factor2 from t group by id;

id

factor1

factor2

1

-0.258685

0.732795

2

-0.148325

0.303584

3

0.272352

-0.065526

To customize the output, we can write a simple wrapper. In the following example, the output includes coefficient estimates and R square:

$ def myols(y,x) {
$   r=ols(y,x,true,2)
$   return r.Coefficient.beta join r.RegressionStat.statistics[0]
$ }

$ select myols(y,(factor1,factor2)) as `int`factor1`factor2`R2 from t group by id;

id

int

factor1

factor2

R2

1

1.063991

-0.258685

0.732795

0.946056

2

6.886877

-0.148325

0.303584

0.992413

3

11.833867

0.272352

-0.065526

0.144837