Example 1. Load large text files




Very often we need to import data in large text files into a database. This could be the bottleneck of the entire data analysis process. To speed up the process, we can load a file in parallel.  



To load USstocks.csv in parallel:



USstocks = ploadText("C:/DolphinDB/Data/USstocks.csv");



In comparison, to load the file in a non-parallel fashion:



USstocks = loadText("C:/DolphinDB/Data/USstocks.csv");



Loading a file in parallel is generally faster than in a non-parallel fashion.



>timer USstocks = ploadText("C:/DolphinDB/Data/USstocks.csv");

Time elapsed: 22415.377 ms


>timer USstocks = loadText("C:/DolphinDB/Data/USstocks.csv");

Time elapsed: 59841.574 ms



Loading a text file in parallel requires a minimum size of available memory that is twice the size of the table. In comparison, loading a text file in a non-parallel fashion requires that the size of the available memory is larger than the size of the table. If we need to import a data file that will produce a table larger than available memory, we can save the data as a partitioned table in a distributed database. Please see Example 2.


When the system loads an external text file, it makes a random sampling of each column and infers its likely data type.


We recommend users check the data types of all the columns in the text file with function extractTextSchema. In this example, as some of the values of RET are 'C', the system determines that the data type of RET is STRING.



schema = extractTextSchema("c:/DolphinDB/Data/USstocks.csv");

schema;


name    type

------- ------

PERMNO  INT

date    DATE

SHRCD   INT

TICKER  SYMBOL

TRDSTAT CHAR

PERMCO  INT

HSICCD  INT

CUSIP   SYMBOL

DLSTCD  SYMBOL

DLPRC   SYMBOL

DLRET   SYMBOL

BIDLO   DOUBLE

ASKHI   DOUBLE

PRC     DOUBLE

VOL     INT

RET     STRING

BID     DOUBLE

ASK     DOUBLE

SHROUT  INT

CFACPR  DOUBLE

CFACSHR DOUBLE

OPENPRC DOUBLE



The following script changes the data type of RET to DOUBLE, and loads the text file into DolphinDB with the correct schema.



update schema set type=`DOUBLE where name=`RET;

USstocks = ploadText("c:/DolphinDB/Data/USstocks.csv", , schema);

 


The first 20 observations of the table USstocks in the Data Browser window:




Download source code here.