Very often we need to import data in large text files into a database. This could be the bottleneck of the entire data analysis process. To speed up the process, we can load a file in parallel.
To load USstocks.csv in parallel:
USstocks = ploadText("C:/DolphinDB/Data/USstocks.csv");
In comparison, to load the file in a non-parallel fashion:
USstocks = loadText("C:/DolphinDB/Data/USstocks.csv");
Loading a file in parallel is generally faster than in a non-parallel fashion.
>timer USstocks = ploadText("C:/DolphinDB/Data/USstocks.csv");
Time elapsed: 22415.377 ms
>timer USstocks = loadText("C:/DolphinDB/Data/USstocks.csv");
Time elapsed: 59841.574 ms
Loading a text file in parallel requires a minimum size of available memory that is twice the size of the table. In comparison, loading a text file in a non-parallel fashion requires that the size of the available memory is larger than the size of the table. If we need to import a data file that will produce a table larger than available memory, we can save the data as a partitioned table in a distributed database. Please see Example 2.
When the system loads an external text file, it makes a random sampling of each column and infers its likely data type.
We recommend users check the data types of all the columns in the text file with function extractTextSchema. In this example, as some of the values of RET are 'C', the system determines that the data type of RET is STRING.
schema = extractTextSchema("c:/DolphinDB/Data/USstocks.csv");
The following script changes the data type of RET to DOUBLE, and loads the text file into DolphinDB with the correct schema.
update schema set type=`DOUBLE where name=`RET;
USstocks = ploadText("c:/DolphinDB/Data/USstocks.csv", , schema);
The first 20 observations of the table USstocks in the Data Browser window:
Download source code here.