group by
We use keyword group by followed by column name(s) to group the records and to apply functions on each group of records. The order of columns in the table generated by a SQL statement is: The grouping columns not specified by the select clause are displayed first, then follow the columns specified by both the group by clause and the select clause. Therefore unlike in standard SQL, we don’t need to include group by columns explicitly in select statements.
Examples
$ sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
$ price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
$ qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
$ timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
$ t1 = table(timestamp, sym, qty, price);
$ t1;
timestamp |
sym |
qty |
price |
---|---|---|---|
09:34:07 |
C |
2200 |
49.6 |
09:36:42 |
MS |
1900 |
29.46 |
09:36:51 |
MS |
2100 |
29.52 |
09:36:59 |
MS |
3200 |
30.02 |
09:32:47 |
IBM |
6800 |
174.97 |
09:35:26 |
IBM |
5400 |
175.2 |
09:34:16 |
C |
1300 |
50.76 |
09:34:26 |
C |
2500 |
50.32 |
09:38:12 |
C |
8800 |
51.29 |
select count(sym) as counts from t1 group by sym;
sym |
counts |
---|---|
C |
4 |
MS |
3 |
IBM |
2 |
select top 2:3 count(sym) as counts from t1 group by sym;
sym |
counts |
---|---|
IBM |
2 |
select max(price), sym from t1 group by sym, minute(timestamp)
minute_timestamp |
max_price |
sym |
---|---|---|
09:34m |
50.76 |
C |
09:36m |
30.02 |
MS |
09:32m |
174.97 |
IBM |
09:35m |
175.23 |
IBM |
09:38m |
51.29 |
C |
select avg(qty) from t1 group by sym;
sym |
avg_qty |
---|---|
C |
3700 |
MS |
2400 |
IBM |
6100 |
select wavg(price, qty) as vwap, sum(qty) from t1 group by sym;
sym |
vwap |
sum_qty |
---|---|---|
C |
50.828378 |
14800 |
IBM |
175.085082 |
12200 |
MS |
29.726389 |
7200 |
select wsum(price, qty) as dollarVolume, sum(qty) from t1 group by minute(timestamp) as ts;
ts |
dollarVolume |
sum_qty |
---|---|---|
09:32m |
1.189796e+006 |
6800 |
09:34m |
300908 |
6000 |
09:35m |
946242 |
5400 |
09:36m |
214030 |
7200 |
09:38m |
451352 |
8800 |
select sum(qty) from t1 group by sym, timestamp.minute() as minute;
sym |
minute |
sum_qty |
---|---|---|
C |
09:34m |
6000 |
C |
09:38m |
8800 |
IBM |
09:32m |
6800 |
IBM |
09:35m |
5400 |
MS |
09:36m |
7200 |
When group by is used with order by, the order by column(s) can be one or more of group by columns, or one or more of the new columns generated by group by calculation.
select sum(qty) from t1 group by sym, timestamp.minute() as minute order by minute;
sym |
minute |
sum_qty |
---|---|---|
C |
09:00m |
1100 |
IBM |
09:32m |
6800 |
C |
09:34m |
6000 |
IBM |
09:35m |
5400 |
MS |
09:36m |
7200 |
C |
09:38m |
8800 |
In the examples above, the result of the function for each group is a scalar. In other cases, a function may output a vector or a dictionary rather than a scalar. For examples, the function stat outputs a dictionary, and the function ols outputs a coefficient vector or a dictionary that includes parameter estimates, t-stat, R square, etc. To output the results in multiple columns, we need to convert the vector/dictionary output to multiple scalars. In other words, we need to convert a composite column into multiple columns. This can be accomplished with “as” keyword and a constant string vector for column names.
$ y=1..15
$ factor1=3.2 1.2 5.9 6.9 11.1 9.6 1.4 7.3 2.0 0.1 6.1 2.9 6.3 8.4 5.6
$ factor2=1.7 1.3 4.2 6.8 9.2 1.3 1.4 7.8 7.9 9.9 9.3 4.6 7.8 2.4 8.7
$ id=take(1 2 3, 15).sort();
$ t=table(id, y, factor1, factor2);
$ t;
id |
y |
factor1 |
factor2 |
---|---|---|---|
1 |
1 |
3.2 |
1.7 |
1 |
2 |
1.2 |
1.3 |
1 |
3 |
5.9 |
4.2 |
1 |
4 |
6.9 |
6.8 |
1 |
5 |
11.1 |
9.2 |
2 |
6 |
9.6 |
1.3 |
2 |
7 |
1.4 |
1.4 |
2 |
8 |
7.3 |
7.8 |
2 |
9 |
2 |
7.9 |
2 |
10 |
0.1 |
9.9 |
3 |
11 |
6.1 |
9.3 |
3 |
12 |
2.9 |
4.6 |
3 |
13 |
6.3 |
7.8 |
3 |
14 |
8.4 |
2.4 |
3 |
15 |
5.6 |
8.7 |
select ols(y,(factor1,factor2),true,0) as `int`factor1`factor2 from t group by id;
id |
int |
factor1 |
factor2 |
---|---|---|---|
1 |
1.063991 |
-0.258685 |
0.732795 |
2 |
6.886877 |
-0.148325 |
0.303584 |
3 |
11.833867 |
0.272352 |
-0.065526 |
select ols(y,(factor1,factor2),true,2).Coefficient.tstat[1:] as `t1`t2 from t group by id;
id |
t1 |
t2 |
---|---|---|
1 |
-0.891868 |
2.253451 |
2 |
-5.73315 |
11.433117 |
3 |
0.510866 |
-0.183903 |
To skip certain elements from the output of the function, we can leave the names of the corresponding elements empty.
select ols(y,(factor1,factor2),true,2).Coefficient.beta as ``factor1`factor2 from t group by id;
id |
factor1 |
factor2 |
---|---|---|
1 |
-0.258685 |
0.732795 |
2 |
-0.148325 |
0.303584 |
3 |
0.272352 |
-0.065526 |
To customize the output, we can write a simple wrapper. In the following example, the output includes coefficient estimates and R square:
$ def myols(y,x) {
$ r=ols(y,x,true,2)
$ return r.Coefficient.beta join r.RegressionStat.statistics[0]
$ }
$ select myols(y,(factor1,factor2)) as `int`factor1`factor2`R2 from t group by id;
id |
int |
factor1 |
factor2 |
R2 |
---|---|---|---|---|
1 |
1.063991 |
-0.258685 |
0.732795 |
0.946056 |
2 |
6.886877 |
-0.148325 |
0.303584 |
0.992413 |
3 |
11.833867 |
0.272352 |
-0.065526 |
0.144837 |