context by

context by is a unique feature in DolphinDB. It is an extension to standard SQL for convenient time-series data manipulation.

The traditional relational database doesn’t support time series data processing. In RDBMS, a table is a set of rows and the order of rows is not modeled. We can apply aggregate functions such as min, max, avg, stdev to a group of rows, but we can’t use order sensitive aggregate functions such as first, last, etc, or order sensitive vector functions such as cumsum, cummax, ratios, deltas on groups of rows. DolphinDB supports time series data processing. The context by clause make it very convenient to perform time series data processing within each group.

Both context by and group by conduct grouping. However, with group by, each group returns a scalar value; with context by, each group returns a vector of the same size as the group’s records. The group by clause can only be used with aggregate functions, whereas the context by clause can be used with aggregate functions, moving window functions, cumulative functions, etc. The context by clause is often used with update statement. The context by clause can also be used together with the having clause. Please refer to the section about having.

context by is often used together with time-series functions such as cumsum and mavg etc. The results of these functions are affected by the order of rows within each context by group. We can use keyword csort after context by. csort sorts the rows within each context by group before conducting calculations in select statements. csort can use multiple columns including calculated columns and can sort in both ascending order (asc, the default sorting order) and descending order (desc). csort can be used with top clause to get the most recent observations within each group.

The example below illustrates the difference between group by and context by.

sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
t1 = table(timestamp, sym, qty, price);

t1;

timestamp	sym	qty	price
09:34:07	C	2200	49.6
09:36:42	MS	1900	29.46
09:36:51	MS	2100	29.52
09:36:59	MS	3200	30.02
09:32:47	IBM	6800	174.97
09:35:26	IBM	5400	175.23
09:34:16	C	1300	50.76
09:34:26	C	2500	50.32
09:38:12	C	8800	51.29

select wavg(price,qty) as wvap, sum(qty) as totalqty from t1 group by sym;

sym	wvap	totalqty
C	50.828378	14800
IBM	175.085082	12200
MS	29.726389	7200

select sym, price, qty, wavg(price,qty) as wvap, sum(qty) as totalqty from t1 context by sym;

sym	price	qty	wvap	totalqty
C	49.6	2200	50.828378	14800
C	50.76	1300	50.828378	14800
C	50.32	2500	50.828378	14800
C	51.29	8800	50.828378	14800
IBM	174.97	6800	175.085082	12200
IBM	175.23	5400	175.085082	12200
MS	29.46	1900	29.726389	7200
MS	29.52	2100	29.726389	7200
MS	30.02	3200	29.726389	7200

To calculate stock returns for each firm, we cannot use group by. Instead we can use context by. We need to make sure the records are sorted appropriately within each group before we use context by.

$ select sym, timestamp, price, eachPre(\,price)-1.0 as ret from t1 context by sym;

sym	timestamp	price	ret
C	09:34:07	49.6
C	09:34:16	50.76	0.023387
C	09:34:26	50.32	-0.008668
C	09:38:12	51.29	0.019277
IBM	09:32:47	174.97
IBM	09:35:26	175.23	0.001486
MS	09:36:42	29.46
MS	09:36:51	29.52	0.002037
MS	09:36:59	30.02	0.016938

We can use template function contextby (:X) for the same calculation, but the result is a vector instead of a table.

$ contextby(eachPre{ratio}, t1.price, t1.sym);
[,,1.002037,1.016938,,1.001486,1.023387,0.991332,1.019277]

Here we use a partial application eachPre{ratio}. Please refer to Partial Application for details.

Calculate cumulative sum of trading volume for each stock in every minute:

select *, cumsum(qty) from t1 context by sym, timestamp.minute();

timestamp	sym	qty	price	cumsum_qty
09:34:07	C	2200	49.6	2200
09:34:16	C	1300	50.76	3500
09:34:26	C	2500	50.32	6000
09:38:12	C	8800	51.29	8800
09:32:47	IBM	6800	174.97	6800
09:35:26	IBM	5400	175.23	5400
09:36:42	MS	1900	29.46	1900
09:36:51	MS	2100	29.52	4000
09:36:59	MS	3200	30.02	7200

Use top clause with context by clause:

select top 2 * from t1 context by sym;

timestamp	sym	qty	price
09:34:07	C	2200	49.6
09:34:16	C	1300	50.76
09:32:47	IBM	6800	174.97
09:35:26	IBM	5400	175.23
09:36:42	MS	1900	29.46
09:36:51	MS	2100	29.52

Please note that we cannot specify a range for the top clause when it is used with the context by clause:

$ select top 2:3 * from t1 context by sym;
Syntax Error: [line #2] When top clause uses together with context clause in SQL query, can't specify a range in top clause

Use top clause and csort clause together with context by clause to get the most recent 2 records for each stock:

$ select top 2 * from t1 context by sym csort timestamp desc;

timestamp	sym	qty	price
09:38:12	C	8800	51.29
09:34:26	C	2500	50.32
09:35:26	IBM	5400	175.23
09:32:47	IBM	6800	174.97
09:36:59	MS	3200	30.02
09:36:51	MS	2100	29.52

Use context by clause together with limit clause to get the first or the last few records for each stock:

$ select * from t1 context by sym limit 2;

timestamp	sym	qty	price
09:34:07	C	2200	49.6
09:36:42	MS	1900	29.46
09:36:51	MS	2100	29.52
09:32:47	IBM	6800	174.97
09:35:26	IBM	5400	175.23
09:34:16	C	1300	50.76

Use context by clause together with csort and limit clause to get the last 2 records for each stock after sorting by qty:

$ select * from t1 context by sym limit -2;

timestamp	sym	qty	price
09:36:51	MS	2100	29.52
09:36:59	MS	3200	30.02
09:32:47	IBM	6800	174.97
09:35:26	IBM	5400	175.23
09:34:26	C	2500	50.32
09:38:12	C	8800	51.29

Use context by clause together with csort and limit clause to get the last 2 records for each stock after sorting by qty:

$ select * from t1 context by sym csort qty limit -2;

timestamp	sym	qty	price
09:34:26	C	2500	50.32
09:38:12	C	8800	51.29
09:35:26	IBM	5400	175.23
09:32:47	IBM	6800	174.97
09:36:51	MS	2100	29.52
09:36:59	MS	3200	30.02

Calculate fitted values of price from the regression of price on qty for each stock:

$ select *, ols(price, qty)[0]+ols(price, qty)[1]*qty as fittedPrice from t1 context by sym;

timestamp	sym	qty	price	fittedPrice
09:34:07	C	2200	49.6	50.282221
09:34:16	C	1300	50.76	50.156053
09:34:26	C	2500	50.32	50.324277
09:38:12	C	8800	51.29	51.207449
09:32:47	IBM	6800	174.97	174.97
09:35:26	IBM	5400	175.23	175.23
09:36:42	MS	1900	29.46	29.447279
09:36:51	MS	2100	29.52	29.535034
09:36:59	MS	3200	30.02	30.017687

The context by clause can be used with the order by clause. The order by columns must be among the output columns.

$ select *, ols(price, qty)[0]+ols(price, qty)[1]*qty as fittedPrice from t1 context by sym order by timestamp;

timestamp	sym	qty	price	fittedPrice
09:32:47	IBM	6800	174.97	174.97
09:34:07	C	2200	49.6	50.075318
09:34:16	C	1300	50.76	49.911222
09:34:26	C	2500	50.32	50.130017
09:35:26	IBM	5400	175.23	175.23
09:36:42	MS	1900	29.46	29.447279
09:36:51	MS	2100	29.52	29.535034
09:36:59	MS	3200	30.02	30.017687
09:38:12	C	8800	51.29	51.278686

context by is different from the contextby function in 3 aspects:

(1) contextby generates a vector while context by is used in a select clause to produce a table.

(2) contextby is limited to one grouping column whereas a context by clause can be used on multiple columns.

(3) contextby calculates one item for every call, whereas a context by clause can calculate multiple items.

Performance Tip

Before we use context by, we should sort the database by the same variable or variables in the context by clause. This could greatly improve the speed of context by clause.

$ n=1000000
$ ID=rand(100, n)
$ x=rand(10.0, n)
$ ta=table(ID, x)
$ tb=select * from ta order by ID;

$ timer select (NULL \:P x)-1 as ret from ta context by ID;
Time elapsed: 4.018 ms

$ timer select (NULL \:P x)-1 as ret from tb context by ID;
Time elapsed: 2.991 ms

DolphinDB optimizes the performance of context by clause under certain conditions.

To query the latest records of given groups in a partitioned table, use context by clause with keywords csort and limit. The performance of “context by + csort + limit” statement is optimized if the following conditions are satisfied:

1. The context by column is filtered by the where clause.

2. The csort column is a partitioning column, and the partition type is VALUE or RANGE.

3. csort and context by can only specify one column.

4. The context by column is specified in the select clause.

You can obtain the execution plan by adding keyword [HINT_EXPLAIN] to check whether the statement is optimized.