loadText

Syntax

loadText(filename, [delimiter], [schema], [skipRows=0], [arrayDelimiter], [containHeader])

Arguments

filename the input text file name with its absolute path.

delimiter a STRING scalar indicating the table column separator. It can consist of one or more characters, with the default being a comma (‘,’).

schema a table. It can have the following columns, among which “name” and “type” columns are required.

column name

data type

meaning

name

STRING scalar

column name

type

STRING scalar

data type (BLOB not supported)

format

STRING scalar

the format of temporal columns

col

INT scalar or vector

the columns to be loaded

Note: If “type” specifies a temporal data type, the format of the source data must match a DolphinDB temporal data type. If the format of the source data and the DolphinDB temporal data types are incompatible, you can specify the column type as STRING when loading the data and convert it to a DolphinDB temporal data type using the temporalParse function afterwards.

skipRows is an integer between 0 and 1024 indicating the rows in the beginning of the text file to be ignored. The default value is 0.

arrayDelimiter is a single character indicating the delimiter for columns holding the array vectors in the file. You must use the schema parameter to update the data type of the type column with the corresponding array vector data type before import.

containHeader a Boolean value indicating whether the file contains a header row. The default value is NULL.

Details

Load a text file into memory as a table. loadText loads data in single thread. To load data in multiple threads, use ploadText .

  • How a header row is determined:

    • When containHeader is NULL, the first row of the file is read in string format, and the column names are parsed from that data. Please note that the upper limit for the first row is 256 KB. If none of the columns in the first row of the file starts with a number, the first row is treated as the header with column names of the text file. If at least one of the columns in the first row of the file starts with a number, the system uses col0, col1, … as the column names;

    • When containHeader is true, the first row is determined as the header row, and the column names are parsed from that data;

    • When containHeader is false, the system uses col0, col1, … as the column names.

  • How the column types are determined:

When loading a text file, the system determines the data type of each column based on a random sample of rows. This convenient feature may not always accurately determine the data type of all columns. We recommend users check the data type of each column with the extractTextSchema function after loading. If a column does not have the expected data type, then we need to enter the correct data type of the column in the schema table. Users can also specify data types for all columns. For a temporal column, if it does not have the expected data type, we also need to specify a format such as “MM/dd/yyyy” in the schema table. For details about temporal formats please refer to Parsing and Format of Temporal Variables.

As string in DolphinDB is encoded in UTF-8, we require input text files be encoded in UTF-8.

Column names in DolphinDB must only contain letters, numbers or underscores and must start with a letter. If a column name in the text file does not meet the requirements, the system automatically adjusts it:

  • If the column name contains characters other than letters, numbers or underscores, these characters are converted into underscores.

  • If the column name does not start with a letter, add “c” to the column name so that it starts with “c”.

A few examples:

Column name in data files

Adjusted column name

1_test

c1_test

test-a!

test_a_

[test]

c_test_

Examples

Use the following script to generate the data file to be used for the examples:

$ n=10
$ sym=rand(`AAPL`ORCL`MS`SUN,n)
$ permno=take(10001,n)
$ date=rand(2019.06.01..2019.06.10,n)
$ open=rand(100.0,n)
$ high=rand(200.0,n)
$ close=rand(200.0,n)
$ pre_close=rand(200.0,n)
$ change=rand(100.0,n)
$ vol=rand(10000,n)
$ amount=rand(100000.0,n)
$ t=table(sym,permno,date,open,high,close,pre_close,change,vol,amount)
$ saveText(t,"C:/DolphinDB/Data/stock.csv");

Ex 1. Use loadText without specifying any optional parameters:

$ tt=loadText("C:/DolphinDB/Data/stock.csv");

$ tt;

sym

permno

date

open

high

close

pre_close

change

vol

amount

MS

10001

2019.06.06

90.346594

80.530542

96.474428

146.305659

0.720236

1045

90494.568297

AAPL

10001

2019.06.07

91.165315

8.482074

85.514922

16.259077

76.797829

7646

91623.485996

AAPL

10001

2019.06.03

45.361885

14.077451

149.848419

89.110375

45.499145

9555

98171.601654

MS

10001

2019.06.04

8.98688

0.591778

155.54643

132.423187

69.95799

1202

3512.927634

MS

10001

2019.06.07

62.866173

33.465237

174.20712

102.695818

74.580523

3524

61943.64517

MS

10001

2019.06.09

32.819915

13.319577

136.729618

63.980405

60.66375

7078

85138.216568

MS

10001

2019.06.07

90.210866

22.728777

150.212291

59.454705

73.916303

5306

19883.845607

AAPL

10001

2019.06.06

83.752686

71.3501

98.211979

145.60098

94.428343

8852

9236.020781

ORCL

10001

2019.06.01

81.64719

129.702202

182.784373

117.575967

74.84595

2942

43394.871242

AAPL

10001

2019.06.02

10.068382

80.875383

181.674585

138.783821

25.298267

1088

82981.043775

$ schema(tt).colDefs;

name

typeString

typeInt

comment

sym

SYMBOL

17

permno

INT

4

date

DATE

6

open

DOUBLE

16

high

DOUBLE

16

close

DOUBLE

16

pre_close

DOUBLE

16

change

DOUBLE

16

vol

INT

4

amount

DOUBLE

16

Ex 2. Specify the data type of a column before loading the file.

We may want to change the data type of column “permno” to be SYMBOL. For this, we need to use function extractTextSchema to get the schema table, update it, then load the text file with the revised schema table.

$ schema=extractTextSchema("C:/DolphinDB/Data/stock.csv");
$ update schema set type=`SYMBOL where name=`permno;
$ tt=loadText("C:/DolphinDB/Data/stock.csv",,schema);
$ schema(tt).colDefs;

name

typeString

typeInt

comment

sym

SYMBOL

17

permno

SYMBOL

17

date

DATE

6

open

DOUBLE

16

high

DOUBLE

16

close

DOUBLE

16

pre_close

DOUBLE

16

change

DOUBLE

16

vol

INT

4

amount

DOUBLE

16

You can also specify the data types of all columns:

$ schematable=table(`sym`permno`date`open`high`close`pre_close`change`vol`amount as name,`SYMBOL`SYMBOL`DATE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`INT`DOUBLE as type)
$ tt=loadText("C:/DolphinDB/Data/stock.csv",,schematable)
$ schema(tt).colDefs;

name

typeString

typeInt

comment

sym

SYMBOL

17

permno

SYMBOL

17

date

DATE

6

open

DOUBLE

16

high

DOUBLE

16

close

DOUBLE

16

pre_close

DOUBLE

16

change

DOUBLE

16

vol

INT

4

amount

DOUBLE

16

Ex 3. Load only a subset of columns.

For example, we may only need to load the following 7 columns: sym, date, open, high, close, vol, amount. Please note that we cannot change the order of columns when loading data. To change the order of columns in the loaded table, use function reorderColumns!.

$ schema=extractTextSchema("C:/DolphinDB/Data/stock.csv");
$ schema=select * from schema where name in `sym`date`open`high`close`vol`amount
$ schema[`col]=[0,2,3,4,5,8,9]

$ tt=loadText("C:/DolphinDB/Data/stock.csv",,schema);
$ tt;

sym

date

open

high

close

vol

amount

SUN

2019.06.10

18.675316

72.754005

136.463909

1376

31371.319038

AAPL

2019.06.05

42.098717

196.873587

41.513899

3632

9950.864129

ORCL

2019.06.05

62.223474

197.099027

123.785675

3069

38035.800937

SUN

2019.06.03

0.18163

50.669866

4.652098

6213

1842.198893

SUN

2019.06.06

32.54134

67.012502

130.312294

4891

55744.156823

SUN

2019.06.07

56.899091

81.709825

61.786176

1133

69057.849515

AAPL

2019.06.08

77.026838

38.504431

22.68496

3672

34420.187073

ORCL

2019.06.07

62.752656

39.33621

48.483091

4382

41601.601639

AAPL

2019.06.02

8.5487

17.623418

141.88325

8092

15449.159988

AAPL

2019.06.02

26.178685

197.320455

110.52407

5541

14616.820449

Ex 4. Skip the first 2 rows when loading.

Please note that as the header is the first line of the text file, it is also skipped.

$ re=loadText(filename="C:/DolphinDB/Data/stock.csv",skipRows=2)
$ select count(*) from re;

count

9

Ex 5. Specify the temporal format when loading the file.

Generate the text file to be used:

$ time=["20190623145457","20190623155423","20190623163025"]
$ sym=`AAPL`MS`IBM
$ qty=2200 5400 8670
$ price=54.78 59.64 65.23
$ t=table(time,sym,qty,price)
$ saveText(t,"C:/DolphinDB/Data/t2.csv");

Obtain the text schema with extractTextSchema before loading the file:

$ extractTextSchema("C:/DolphinDB/Data/t2.csv");

name

type

time

LONG

sym

SYMBOL

qty

INT

price

DOUBLE

From the example above, if we load this text file without specifying the format of column “time”, column “time” is empty as the system cannot parse the raw data correctly. For this scenario we must specify the format of the column.

$ schema=extractTextSchema("C:/DolphinDB/Data/t2.csv")
$ update schema set type = "DATETIME" where name = "time"
$ schema[`format]=["yyyyMMddHHmmss",,,];

$ loadText("C:/DolphinDB/Data/t2.csv",,schema);

time

sym

qty

price

2019.06.23T14:54:57

AAPL

2200

54.78

2019.06.23T15:54:23

MS

5400

59.64

2019.06.23T16:30:25

IBM

8670

65.23

Ex.6. Load array vectors

Create a csv file

$ bid = array(DOUBLE[], 0, 20).append!([1.4799 1.479 1.4787, 1.4796 1.479 1.4784, 1.4791 1.479 1.4784])
$ ask = array(DOUBLE[], 0, 20).append!([1.4821 1.4825 1.4828, 1.4818 1.482 1.4821, 1.4814 1.4818 1.482])
$ TradeDate = 2022.01.01 + 1..3
$ SecurityID = rand(`APPL`AMZN`IBM, 3)
$ t = table(SecurityID as `sid, TradeDate as `date, bid as `bid, ask as `ask)
$ t;
$ saveText(t,filename="D:/t.csv",delimiter=',',append=true)

Load with loadText

$ path = "D:/t.csv"
$ schema=extractTextSchema(path);
$ update schema set type = "DOUBLE[]" where name="bid" or name ="ask"
$ t = loadText(path, schema=schema, arrayDelimiter=",")
$ t;

sid

date

bid

ask

AMZN

2022.01.02

[1.4799,1.479,1.4787]

[1.4821,1.4825,1.4828]

AMZN

2022.01.03

[1.4796,1.479,1.4784]

[1.4818,1.482,1.4821]

IBM

2022.01.04

[1.4791,1.479,1.4784]

[1.4814,1.4818,1.482]