Reading a large data file (efficiently)

From LPTMS Wiki
Revision as of 19:23, 16 February 2014 by Landes (talk | contribs) (j'ai expliqué comment utiliser with open pour obtenir toute une matrice de donnees, mais de facon rapide. Et aussi un mot sur l'utilisation du module "subprocess")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Reading large data files can quickly become a trouble.

np.loadtxt('filename') allows an easy conversion of the file to an array, but it is unpractical, in particular if your data file size exceeds your RAM.

An other, much more efficient way (with the appropriate buffers, etc. handled by python) is using "with open(...) as file".

import numpy as np

N=10000000
bigMatrix = np.zeros((N, 12))      # same shape as the expected data. Here, we have 12 columns. 
                                   # With this N , "bigMatrix" is more or less 1 GB large.
iteration = 0
with open(filename, 'r') as f:    # this is an efficient way of handling the file.
    for line in f:
        bigMatrix[iteration] = np.fromstring(line, sep=' ')  # if the column separator is a space " ". Adapt otherwise.
        iteration +=1
        if iteration >= N:  # in order not to exceed the matrix size, if the data is longer than N.
            break
bigMatrix =  bigMatrix[:iteration, :]     # in order not to have leftover zeros, if the data is shorter than N.

the only limitation is that you need to specify a shape (esp. the column number) in advance, but usually if you want to analyze many files with some format that you invented, this should not be a problem.

A possible way to circumvent the problem of choosing N in advance is to run something like

import subprocess
output_string = subprocess.check_output(['wc -l my_data_file_name.dat'], shell=True)
number_of_lines_in_file = np.fromstring(output_string, sep=' ')[0]

and then use the resulting line count as N.