where

The where clause is used to extract only the records that satisfy the specified condition(s).

A SQL query with a where clause first loads and filters the columns specified in the where conditions, then retrieves data from other columns corresponding to the filtered results. When the columns specified in the where conditions account for a large proportion of the queried data, you can specify keyword [HINT_PRELOAD] to preload all data into memory before filtering to improve the query performance.

Examples

$ sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
$ price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
$ qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
$ timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
$ t1 = table(timestamp, sym, qty, price);
$ t1;

timestamp

sym

qty

price

09:34:07

C

2200

49.6

09:36:42

MS

1900

29.46

09:36:51

MS

2100

29.52

09:36:59

MS

3200

30.02

09:32:47

IBM

6800

174.97

09:35:26

IBM

5400

175.23

09:34:16

C

1300

50.76

09:34:26

C

2500

50.32

09:38:12

C

8800

51.29

Use where clause with one condition:

select * from t1 where sym=`IBM;

timestamp

sym

qty

price

09:32:47

IBM

6800

174.97

09:35:26

IBM

5400

175.23

select * from t1 where timestamp.minute()>=09:36m;

timestamp

sym

qty

price

09:36:42

MS

1900

29.46

09:36:51

MS

2100

29.52

09:36:59

MS

3200

30.02

09:38:12

C

8800

51.29

In where clause, “==” is equivalent to “=”.

select * from t1 where sym==`IBM;

timestamp

sym

qty

price

09:32:47

IBM

6800

174.97

09:35:26

IBM

5400

175.23

Use where clause with multiple conditions:

select * from t1 where sym=`IBM and qty>=2000 or timestamp>09:37:00;

timestamp

sym

qty

price

09:32:47

IBM

6800

174.97

09:35:26

IBM

5400

175.23

09:38:12

C

8800

51.29

For multiple conditions in a where clause, symbols “and”, “&&” and “,” can be used to perform logical “AND” operation, and symbols “or” and “||” can be used to perform logical “OR” operation.

select * from t1 where qty>=2000, timestamp.minute()>=09:36m;

timestamp

sym

qty

price

09:36:51

MS

2100

29.52

09:36:59

MS

3200

30.02

09:38:12

C

8800

51.29

select * from t1 where qty>=2000 and timestamp.minute()>=09:36m;

timestamp

sym

qty

price

09:36:51

MS

2100

29.52

09:36:59

MS

3200

30.02

09:38:12

C

8800

51.29

select * from t1 where qty>=2000 && timestamp.minute()>=09:36m;

timestamp

sym

qty

price

09:36:51

MS

2100

29.52

09:36:59

MS

3200

30.02

09:38:12

C

8800

51.29

Differences between comma “,” and “and/&&”:

“and” is equivalent to “&&”. Both test each condition on the entire table. In contrast, when we use “,” in the conditions, data is filtered by the first condition, then the next condition is applied to the filtered dataset, and so on. This reduces the search footprint. Therefore, comma “,” is preferred to “and/&&” in a where statement when some of the conditions are very specific. We should use the most specific conditions first when using commas in the conditions.

In the following example, with a very restrictive condition “id=1” as only about 0.1% of the records satisfy this condition, using “,” in the where clause takes only about 50% of the time as using “and/&&”.

Use where clause with a function:

To select records whose prices are higher than the average price, function avg is applied to all records in the table.

select * from t1 where price>avg(price);

timestamp

sym

qty

price

09:32:47

IBM

6800

174.97

09:35:26

IBM

5400

175.23

To calculate the average price for each stock, we can use the template contextby. The example below selects the records with price greater than the average price for each stock.

select * from t1 where price>contextby(avg, price, sym) order by sym, price;

timestamp

sym

qty

price

09:34:16

C

1300

50.76

09:38:12

C

8800

51.29

09:35:26

IBM

5400

175.23

09:36:59

MS

3200

30.02

We can take a random sample of a number of partitions in a partitioned table. For details please check function sample .

$ n=1000000
$ ID=rand(50, n)
$ x=rand(1.0, n)
$ t=table(ID, x)
$ db=database("dfs://rangedb1", RANGE,  0 10 20 30 40 50)
$ pt = db.createPartitionedTable(t, `pt, `ID)
$ pt.append!(t)
$ pt=loadTable(db,`pt);

Table pt has 5 partitions. To take a random sample of 2 partitions, we can use either of the following queries:

x = select * from pt where sample(ID, 0.4);

x = select * from pt where sample(ID, 2);

Use where clause with a nested SQL statement:

$ t1 = table(`APPL`AMZN`IBM`IBM`AAPL`AMZN as sym, 2022.01.01 + 1..6 as date);
$ t2 = table(`APPL`AMZN`IBM`IBM`AAPL`AMZN as sym, 1.8 2.3 3.7 3.1 4.2 2.8 as price)
$ select count(*) from t1 where sym in select sym from t2 where price > 3
3

$ t3 = table(`APPL`AMZN`IBM`IBM`AAPL`AMZN as sym, 1.9 2.1 2.5 2.6 2.7 3.2 as price)
$ select * from t2 where price > select avg(price) from t3

sym

price

IBM

3.7

IBM

3.1

AAPL

4.2

AMZN

2.8