Building a dataset from large number of CSV

Hi there!

I am trying to build a dataset for further NN training.

I want to prepare a training dataset from files that are stored separately and my question, for now, is how to do it efficiently?

DATA I have a lot of data - more than one million files each CSV file represents a case: it contains the following columns:

y1 | x_val | Z1 | Z2 | … | y2

where column
y1 - input array - about 1000 elements each,
x_val - another input array (the same size)
Z1, Z2, … - additional columns that can be used as features for the training of NN in future.
y2 - output for this case (the same size as first input array)

The main idea is to train NN using y1 and x_val as inputs to predict y2 as output.

I use the next code to build a dataset from separate files:

# Define the path to the input and output data
input_path = 'data/marked/'

# Get the list of input files
input_files = os.listdir(input_path)
# lists for dataset handling
input_data = []
output_data = []

for i, file in enumerate(input_files):
    # Read the input data and output from the CSV file
    df = pd.read_csv(input_path + file)
    # Select the needed columns for the input data
    # Reshape the input data to fit the neural network input
    input_data_from_csv = df[['y1','x_val']].values.T
    
    input_data.append(input_data_from_csv)
    # Append the output data to the list
    output_data.append(df['y2'].values)

# Convert the lists to numpy arrays
X = np.array(input_data)
y = np.array(output_data)

print("Memory size of numpy array X in bytes:", X.size * X.itemsize)
print("Memory size of numpy array y in bytes:", y.size * y.itemsize)

# splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

As the number of files is large - it can’t be loaded into memory (but it works fine with relatively small amounts of data).
For example, for reduced dataset of 1000 files:

Memory size of numpy array X in bytes: 11936000
Memory size of numpy array y in bytes: 5968000

How to construct and save the dataset for training in an efficient way?

And how to save it in a form that is convenient load and process in future?
(I know about Tensorflow.Dataset but how to use it in this case?)

It will be good if you can provide some code examples.

There’s a good solution here: Load CSV data  |  TensorFlow Core

The idea is you create a dataset with your data. That’s lazily loaded and you can apply a preprocessing on it.