context by

context by is a unique feature in DolphinDB. It is an extension to standard SQL for convenient time-series data manipulation.

The traditional relational database doesn’t support time series data processing. In RDBMS, a table is a set of rows and the order of rows is not modeled. We can apply aggregate functions such as min, max, avg, stdev to a group of rows, but we can’t use order sensitive aggregate functions such as first, last, etc, or order sensitive vector functions such as cumsum, cummax, ratios, deltas on groups of rows. DolphinDB supports time series data processing. The context by clause make it very convenient to perform time series data processing within each group.

Both context by and group by conduct grouping. However, with group by, each group returns a scalar value; with context by, each group returns a vector of the same size as the group’s records. The group by clause can only be used with aggregate functions, whereas the context by clause can be used with aggregate functions, moving window functions, cumulative functions, etc. The context by clause is often used with update statement. The context by clause can also be used together with the having clause. Please refer to the section about having.

context by is often used together with time-series functions such as cumsum and mavg etc. The results of these functions are affected by the order of rows within each context by group. We can use keyword csort after context by. csort sorts the rows within each context by group before conducting calculations in select statements. csort can use multiple columns including calculated columns and can sort in both ascending order (asc, the default sorting order) and descending order (desc). csort can be used with top clause to get the most recent observations within each group.

The example below illustrates the difference between group by and context by.

sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
t1 = table(timestamp, sym, qty, price);

t1;

timestamp

sym

qty

price

09:34:07

C

2200

49.6

09:36:42

MS

1900

29.46

09:36:51

MS

2100

29.52

09:36:59

MS

3200

30.02

09:32:47

IBM

6800

174.97

09:35:26

IBM

5400

175.23

09:34:16

C

1300

50.76

09:34:26

C

2500

50.32

09:38:12

C

8800

51.29

select wavg(price,qty) as wvap, sum(qty) as totalqty from t1 group by sym;

sym

wvap

totalqty

C

50.828378

14800

IBM

175.085082

12200

MS

29.726389

7200

select sym, price, qty, wavg(price,qty) as wvap, sum(qty) as totalqty from t1 context by sym;

sym

price

qty

wvap

totalqty

C

49.6

2200

50.828378

14800

C

50.76

1300

50.828378

14800

C

50.32

2500

50.828378

14800

C

51.29

8800

50.828378

14800

IBM

174.97

6800

175.085082

12200

IBM

175.23

5400

175.085082

12200

MS

29.46

1900

29.726389

7200

MS

29.52

2100

29.726389

7200

MS

30.02

3200

29.726389

7200

To calculate stock returns for each firm, we cannot use group by. Instead we can use context by. We need to make sure the records are sorted appropriately within each group before we use context by.

$ select sym, timestamp, price, eachPre(\,price)-1.0 as ret from t1 context by sym;

sym

timestamp

price

ret

C

09:34:07

49.6

C

09:34:16

50.76

0.023387

C

09:34:26

50.32

-0.008668

C

09:38:12

51.29

0.019277

IBM

09:32:47

174.97

IBM

09:35:26

175.23

0.001486

MS

09:36:42

29.46

MS

09:36:51

29.52

0.002037

MS

09:36:59

30.02

0.016938

We can use template function contextby (:X) for the same calculation, but the result is a vector instead of a table.

$ contextby(eachPre{ratio}, t1.price, t1.sym);
[,,1.002037,1.016938,,1.001486,1.023387,0.991332,1.019277]

Here we use a partial application eachPre{ratio}. Please refer to Partial Application for details.

Calculate cumulative sum of trading volume for each stock in every minute:

select *, cumsum(qty) from t1 context by sym, timestamp.minute();

timestamp

sym

qty

price

cumsum_qty

09:34:07

C

2200

49.6

2200

09:34:16

C

1300

50.76

3500

09:34:26

C

2500

50.32

6000

09:38:12

C

8800

51.29

8800

09:32:47

IBM

6800

174.97

6800

09:35:26

IBM

5400

175.23

5400

09:36:42

MS

1900

29.46

1900

09:36:51

MS

2100

29.52

4000

09:36:59

MS

3200

30.02

7200

Use top clause with context by clause:

select top 2 * from t1 context by sym;

timestamp

sym

qty

price

09:34:07

C

2200

49.6

09:34:16

C

1300

50.76

09:32:47

IBM

6800

174.97

09:35:26

IBM

5400

175.23

09:36:42

MS

1900

29.46

09:36:51

MS

2100

29.52

Please note that we cannot specify a range for the top clause when it is used with the context by clause:

$ select top 2:3 * from t1 context by sym;
Syntax Error: [line #2] When top clause uses together with context clause in SQL query, can't specify a range in top clause

Use top clause and csort clause together with context by clause to get the most recent 2 records for each stock:

$ select top 2 * from t1 context by sym csort timestamp desc;

timestamp

sym

qty

price

09:38:12

C

8800

51.29

09:34:26

C

2500

50.32

09:35:26

IBM

5400

175.23

09:32:47

IBM

6800

174.97

09:36:59

MS

3200

30.02

09:36:51

MS

2100

29.52

Use context by clause together with limit clause to get the first or the last few records for each stock:

$ select * from t1 context by sym limit 2;

timestamp

sym

qty

price

09:34:07

C

2200

49.6

09:36:42

MS

1900

29.46

09:36:51

MS

2100

29.52

09:32:47

IBM

6800

174.97

09:35:26

IBM

5400

175.23

09:34:16

C

1300

50.76

Use context by clause together with csort and limit clause to get the last 2 records for each stock after sorting by qty:

$ select * from t1 context by sym limit -2;

timestamp

sym

qty

price

09:36:51

MS

2100

29.52

09:36:59

MS

3200

30.02

09:32:47

IBM

6800

174.97

09:35:26

IBM

5400

175.23

09:34:26

C

2500

50.32

09:38:12

C

8800

51.29

Use context by clause together with csort and limit clause to get the last 2 records for each stock after sorting by qty:

$ select * from t1 context by sym csort qty limit -2;

timestamp

sym

qty

price

09:34:26

C

2500

50.32

09:38:12

C

8800

51.29

09:35:26

IBM

5400

175.23

09:32:47

IBM

6800

174.97

09:36:51

MS

2100

29.52

09:36:59

MS

3200

30.02

Calculate fitted values of price from the regression of price on qty for each stock:

$ select *, ols(price, qty)[0]+ols(price, qty)[1]*qty as fittedPrice from t1 context by sym;

timestamp

sym

qty

price

fittedPrice

09:34:07

C

2200

49.6

50.282221

09:34:16

C

1300

50.76

50.156053

09:34:26

C

2500

50.32

50.324277

09:38:12

C

8800

51.29

51.207449

09:32:47

IBM

6800

174.97

174.97

09:35:26

IBM

5400

175.23

175.23

09:36:42

MS

1900

29.46

29.447279

09:36:51

MS

2100

29.52

29.535034

09:36:59

MS

3200

30.02

30.017687

The context by clause can be used with the order by clause. The order by columns must be among the output columns.

$ select *, ols(price, qty)[0]+ols(price, qty)[1]*qty as fittedPrice from t1 context by sym order by timestamp;

timestamp

sym

qty

price

fittedPrice

09:32:47

IBM

6800

174.97

174.97

09:34:07

C

2200

49.6

50.075318

09:34:16

C

1300

50.76

49.911222

09:34:26

C

2500

50.32

50.130017

09:35:26

IBM

5400

175.23

175.23

09:36:42

MS

1900

29.46

29.447279

09:36:51

MS

2100

29.52

29.535034

09:36:59

MS

3200

30.02

30.017687

09:38:12

C

8800

51.29

51.278686

context by is different from the contextby function in 3 aspects:

(1) contextby generates a vector while context by is used in a select clause to produce a table.

(2) contextby is limited to one grouping column whereas a context by clause can be used on multiple columns.

(3) contextby calculates one item for every call, whereas a context by clause can calculate multiple items.

Performance Tip

Before we use context by, we should sort the database by the same variable or variables in the context by clause. This could greatly improve the speed of context by clause.

$ n=1000000
$ ID=rand(100, n)
$ x=rand(10.0, n)
$ ta=table(ID, x)
$ tb=select * from ta order by ID;
$ timer select (NULL \:P x)-1 as ret from ta context by ID;
Time elapsed: 4.018 ms

$ timer select (NULL \:P x)-1 as ret from tb context by ID;
Time elapsed: 2.991 ms

DolphinDB optimizes the performance of context by clause under certain conditions.

To query the latest records of given groups in a partitioned table, use context by clause with keywords csort and limit. The performance of “context by + csort + limit” statement is optimized if the following conditions are satisfied:

1. The context by column is filtered by the where clause.

2. The csort column is a partitioning column, and the partition type is VALUE or RANGE.

3. csort and context by can only specify one column.

4. The context by column is specified in the select clause.

You can obtain the execution plan by adding keyword [HINT_EXPLAIN] to check whether the statement is optimized.