group by

We use keyword group by followed by column name(s) to group the records and to apply functions on each group of records. The order of columns in the table generated by a SQL statement is: The grouping columns not specified by the select clause are displayed first, then follow the columns specified by both the group by clause and the select clause. Therefore unlike in standard SQL, we don’t need to include group by columns explicitly in select statements.

Examples

$ sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
$ price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
$ qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
$ timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
$ t1 = table(timestamp, sym, qty, price);

$ t1;

timestamp	sym	qty	price
09:34:07	C	2200	49.6
09:36:42	MS	1900	29.46
09:36:51	MS	2100	29.52
09:36:59	MS	3200	30.02
09:32:47	IBM	6800	174.97
09:35:26	IBM	5400	175.2
09:34:16	C	1300	50.76
09:34:26	C	2500	50.32
09:38:12	C	8800	51.29

select count(sym) as counts from t1 group by sym;

sym	counts
C	4
MS	3
IBM	2

select top 2:3 count(sym) as counts from t1 group by sym;

sym	counts
IBM	2

select max(price), sym from t1 group by sym, minute(timestamp)

minute_timestamp	max_price	sym
09:34m	50.76	C
09:36m	30.02	MS
09:32m	174.97	IBM
09:35m	175.23	IBM
09:38m	51.29	C

select avg(qty) from t1 group by sym;

sym	avg_qty
C	3700
MS	2400
IBM	6100

select wavg(price, qty) as vwap, sum(qty) from t1 group by sym;

sym	vwap	sum_qty
C	50.828378	14800
IBM	175.085082	12200
MS	29.726389	7200

select wsum(price, qty) as dollarVolume, sum(qty) from t1 group by minute(timestamp) as ts;

ts	dollarVolume	sum_qty
09:32m	1.189796e+006	6800
09:34m	300908	6000
09:35m	946242	5400
09:36m	214030	7200
09:38m	451352	8800

select sum(qty) from t1 group by sym, timestamp.minute() as minute;

sym	minute	sum_qty
C	09:34m	6000
C	09:38m	8800
IBM	09:32m	6800
IBM	09:35m	5400
MS	09:36m	7200

When group by is used with order by, the order by column(s) can be one or more of group by columns, or one or more of the new columns generated by group by calculation.

select sum(qty) from t1 group by sym, timestamp.minute() as minute order by minute;

sym	minute	sum_qty
C	09:00m	1100
IBM	09:32m	6800
C	09:34m	6000
IBM	09:35m	5400
MS	09:36m	7200
C	09:38m	8800

In the examples above, the result of the function for each group is a scalar. In other cases, a function may output a vector or a dictionary rather than a scalar. For examples, the function stat outputs a dictionary, and the function ols outputs a coefficient vector or a dictionary that includes parameter estimates, t-stat, R square, etc. To output the results in multiple columns, we need to convert the vector/dictionary output to multiple scalars. In other words, we need to convert a composite column into multiple columns. This can be accomplished with “as” keyword and a constant string vector for column names.

$ y=1..15
$ factor1=3.2 1.2 5.9 6.9 11.1 9.6 1.4 7.3 2.0 0.1 6.1 2.9 6.3 8.4 5.6
$ factor2=1.7 1.3 4.2 6.8 9.2 1.3 1.4 7.8 7.9 9.9 9.3 4.6 7.8 2.4 8.7
$ id=take(1 2 3, 15).sort();
$ t=table(id, y, factor1, factor2);

$ t;

id	y	factor1	factor2
1	1	3.2	1.7
1	2	1.2	1.3
1	3	5.9	4.2
1	4	6.9	6.8
1	5	11.1	9.2
2	6	9.6	1.3
2	7	1.4	1.4
2	8	7.3	7.8
2	9	2	7.9
2	10	0.1	9.9
3	11	6.1	9.3
3	12	2.9	4.6
3	13	6.3	7.8
3	14	8.4	2.4
3	15	5.6	8.7

select ols(y,(factor1,factor2),true,0) as `int`factor1`factor2 from t group by id;

id	int	factor1	factor2
1	1.063991	-0.258685	0.732795
2	6.886877	-0.148325	0.303584
3	11.833867	0.272352	-0.065526

select ols(y,(factor1,factor2),true,2).Coefficient.tstat[1:] as `t1`t2 from t group by id;

id	t1	t2
1	-0.891868	2.253451
2	-5.73315	11.433117
3	0.510866	-0.183903

To skip certain elements from the output of the function, we can leave the names of the corresponding elements empty.

select ols(y,(factor1,factor2),true,2).Coefficient.beta as ``factor1`factor2 from t group by id;

id	factor1	factor2
1	-0.258685	0.732795
2	-0.148325	0.303584
3	0.272352	-0.065526

To customize the output, we can write a simple wrapper. In the following example, the output includes coefficient estimates and R square:

$ def myols(y,x) {
$   r=ols(y,x,true,2)
$   return r.Coefficient.beta join r.RegressionStat.statistics[0]
$ }

$ select myols(y,(factor1,factor2)) as `int`factor1`factor2`R2 from t group by id;

id	int	factor1	factor2	R2
1	1.063991	-0.258685	0.732795	0.946056
2	6.886877	-0.148325	0.303584	0.992413
3	11.833867	0.272352	-0.065526	0.144837