summary

Syntax

summary(X,[interpolation],[characteristic],[percentile],[precision],[partitionSampling])

Details

summary generates summary statistics for the input data. It returns an in-memory table containing the minimum, maximum, count, mean, standard deviation, and specified percentiles in ascending order.

Notes:

  • If X is a table, summary only computes statistics for the numeric columns.

  • If X is a data source, it can only contain numeric columns, otherwise an error will occur during computation.

Arguments

X can be an in-memory table, DFS table or data source generated from sqlDS. Note that data sources with SQL metacode containing table joins are currently not supported.

interpolation is a string indicating the interpolation method for percentiles. It can be “linear” (default), “nearest”, “lower”, “higher” and “midpoint”.

characteristic is a string scalar or vector indicating the characteristics to compute. It can be “avg” and/or “std”. Default is both characteristics.

percentile is a DOUBLE vector of percentiles to compute. Each vector element falls between 0 ang 100. The default is [25,50,75], which returns the 25th, 50th, and 75th percentiles.

precision is a DOUBLE scalar greater than 0. The default value is 1e-3, which means the iteration for computing statistics will stop when the difference between the current and previous result is less than or equal to 1e-3. It is recommended to set precision between [1e-3, 1e-9] - small enough for adequate precision but not too small to impact performance through excessive iterations.

partitionSampling can be a positive integer specifying the number of partitions to sample, or a float between (0, 1] specifying the sampling ratio. If not specified, statistics are computed on all partitions. When specifying partitionSampling, note:

  • For a partitioned table:

    • at least one partition will always be sampled. If the sampling ratio * total partitions < 1, one partition is sampled.

    • The sampling ratio is rounded down if the sampling ratio * total partitions is not an integer. E.g. if ratio=0.26 and total partitions is 10, 2 partitions are sampled.

    • If partitionSampling (integer) > total partitions, all partitions are used.

  • partitionSampling has no effect for non-partitioned tables.

Examples

$ n=2022
$ data=1..n
$ value=take(1..3,n)
$ name=take(`APPLE`IBM`INTEL,n)
$ t=table(data,value,name);
$ summary(t, precision=0.001);
// name is not a numeric column and therefore will not be output
name min max nonNullCount count avg std percentile
data 1 2,022 2,022 2,022 1,011.5 583.8454 [506.24,1011.50,1516.75]
value 1 3 2,022 2,022 2 0.8167 [1.00,1.99,2.99]
$ n = 5000
$ data1 = take(1..5000000, n)
$ data2 = rand(10000000, n)
$ data3 = take("A" + string(0..10), n)

$ t = table(data1, data2, data3)
$ dbname = "dfs://summary"
$ if(existsDatabase(dbname)) {
$     dropDatabase(dbname)
$ }
$ db = database(dbname, HASH, [INT, 10])
$ pt = createPartitionedTable(db, t, `pt, `data1)
$ pt.append!(t)

$ ds = sqlDS(<select data1,data2 from loadTable(db, `pt)>)
$ query_percentile = [25,50,75,90]

$ ds_re1 = summary(ds);
//returns the 25th, 50th, 75th and 90th percentiles
$ ds_re2 = summary(ds, percentile=query_percentile, precision=0.0001);
// the partition sampling ratio is 0.6. As there are 10 partitions in total,  6 partitions will be sampled for statistics computation
$ ds_re3 = summary(loadTable(db, `pt), percentile=query_percentile, precision=0.0001, partitionSampling=0.6);