Load CSV Data using tf.data and Data Normalization in Tensorflow

In this tutorial, we will know how to load CSV data using tf.data.Dataset in Tensorflow – Python. Here we will load the titanic dataset which is available in tf-datasets and then we will see why normalization is required and how we can normalize the dataset.
So, at first, let’s understand what is a CSV data and why it is so important to understand CSV data.

What is a CSV data?

CSV is a plain text format where the values are separated by commas. The full form is Comma Separated Values. For example,

Belinda Jameson,2017,Cushing House,148,3.52

In the above example, there are 5 values separated by 4 commas. A CSV format looks like the example given above. It is very necessary to load a CSV data in order to analyze our dataset and in today’s world every data is available in CSV format.

Install Tensorflow

In Tensorflow there are two packages available –

  • Tensorflow
  • tf-nightly

Here we will install tf-nightly package because if we install Tensorflow package we will get an error dataset.__iter__() is only supported when eager execution is enabled.

Here are the things you need to import in the below code-

!pip install tf-nightly-gpu
import numpy as np
import tensorflow as tf

Get the Dataset from URL using tf.keras in Tensorflow

The titanic dataset is available in tf.dataset. To train the data we need to download the data from the url (https://storage.googleapis.com/tf-datasets/titanic/train.csv)  and for evaluation, we download the data from url (https://storage.googleapis.com/tf-datasets/titanic/eval.csv).
We will get the training and evaluation data using tf.keras.utils.get_file.

The code for the above details are:

train_url = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
test_url = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file = tf.keras.utils.get_file("train.csv", train_url)
test_file = tf.keras.utils.get_file("eval.csv", test_url)

np.set_printoptions(precision=3, suppress=True) #precision=3 will make numeric data easy to read.

Load CSV Data in Tensorflow – Python

In order to load a CSV file at first, we have to observe the CSV data. Let’s observe the tail part of CSV by writing following code-

!tail {train_file}

Output-

1,female,15.0,0,0,7.225,Third,unknown,Cherbourg,y
0,male,20.0,0,0,9.8458,Third,unknown,Southampton,y
0,male,19.0,0,0,7.8958,Third,unknown,Southampton,y
0,male,28.0,0,0,7.8958,Third,unknown,Southampton,y
0,female,22.0,0,0,10.5167,Third,unknown,Southampton,y
0,male,28.0,0,0,10.5,Second,unknown,Southampton,y
0,male,25.0,0,0,7.05,Third,unknown,Southampton,y
1,female,19.0,0,0,30.0,First,B,Southampton,y
0,female,28.0,1,2,23.45,Third,unknown,Southampton,n
0,male,32.0,0,0,7.75,Third,unknown,Queenstown,y

Now we observed that the first value is either 0 or 1 which denotes whether that passenger died or survived respectively and this is what we need to predict. The Second value denotes the gender of the passenger and thus each value denotes some feature.

You can also see the name of the other feature by observing the head of the CSV file.

!head {train_file}

Output-

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n

The first row denotes the name of the features in the titanic dataset.

In this dataset, we have to predict whether people will survive or not. So our label name should be survived.

LABEL_COLUMN = 'survived'
LABELS = [0, 1]

 

Now we have to create our own dataset using the give csv format data.To create dataset we will use tf.data.experimental.make_csv_dataset. We can also use pandas dataframe to create numpy array and then passing those array to Tensorflow but the only disadvantage here is that it cannot handle large datasets.

Now the code for creating our dataset is

def get_dataset(file_path,**kwargs):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=5,
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True,**kwargs)
  return dataset

We will see the code line by line –

  1. In the first line, we will pass an argument as file_path which is in CSV format in get_dataset function. In our Titanic dataset, we can either pass train_file or test_file in the get_dataset function.
  2. **kwargs is required to mention if you want to add any row in the dataset. For example, if your dataset doesn’t contain the column which depicts the features of a dataset then we can manually add that row if we write **kwargs.
  3. Now to create our dataset we will pass file_path(which is the CSV data) and a label name(which is to be predicted) in tf.data.experimental.make_csv_dataset.
  4. We have set batch_size to be 5 so that it will be clear and easy to observe 5 rows (batch_size=5 means each batch contains 5 rows).
  5. The number of epochs tells us how many times the data will be repeated and here we have set it to 1.
  6. Now we will return the dataset we have created using tf.data.experimental.make_csv_dataset.

Using the get_dataset function we can get the dataset that can be handled using TensorFlow. The code required to get data for train and test are –

train_data = get_dataset(train_file)
test_data = get_dataset(test_file)

Now to view the dataset generated by get_dataset we can write a function that will take train_data as input and show the data as output.

def show(dataset):
  for batch, label in dataset.take(1):
    for key, value in batch.items():
      print("{:20s}: {}".format(key,value.numpy()))
show(train_data)

The key will show the name of the feature and the value.numpy represents the number of values in each feature according to the batch size.
The output for the above code is –

sex                 : [b'male' b'female' b'male' b'female' b'female']
age                 : [28. 28. 34. 28. 37.]
n_siblings_spouses  : [0 2 1 0 0]
parch               : [0 0 0 2 0]
fare                : [ 7.796 23.25  21.    22.358  9.587]
class               : [b'Third' b'Third' b'Second' b'Third' b'Third']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Queenstown' b'Southampton' b'Cherbourg' b'Southampton']
alone               : [b'y' b'n' b'n' b'n' b'y']

NOTE: Here some people can get an error if they have not installed tf-nightly in Tensorflow.

How to make changes in your dataset in Tensorflow

Suppose, if our dataset does not contain the column names which describes the features of the data then we can pass that column inside *our dataset by creating a list of strings which contains the feature names and then pass that information inside the function(get_dataset) which is used to make our own dataset.

The code for the above explanation is:

FEATURE_COLUMNS=['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone'] #EACH STRING IN A LIST DESCRIBES A FEATURE RESPECTIVELY.
temp = get_dataset(train_file, column_names=FEATURE_COLUMNS) #PASS THE CSV TYPE DATA IN THE GET_DATASET FUNCION ALONGWITH THE FEATURE_COLUMNS
show(temp) #VISUALIZE THE DATA    

The output will look like –

sex                 : [b'male' b'female' b'male' b'female' b'male']
age                 : [28. 34. 18. 24. 11.]
n_siblings_spouses  : [0 0 0 0 0]
parch               : [0 0 0 0 0]
fare                : [ 7.75  10.5   73.5   83.158 18.788]
class               : [b'Third' b'Second' b'Second' b'First' b'Third']
deck                : [b'unknown' b'F' b'unknown' b'C' b'unknown']
embark_town         : [b'Queenstown' b'Southampton' b'Southampton' b'Cherbourg' b'Cherbourg']
alone               : [b'y' b'y' b'y' b'y' b'y']

Now, suppose you want to choose a column to work for training and testing your dataset then you can select those columns by passing a list of strings that contains the specific column names that are to be selected. Then, we must pass that list as a parameter in get_dataset function and visualize the data.

The code for selecting columns and visualizing them is –

SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'class', 'deck', 'alone']

temp = get_dataset(train_file, select_columns=SELECT_COLUMNS)

show(temp)

The output for the above code is –

age                 : [27. 28. 31. 45. 66.]
n_siblings_spouses  : [0 1 0 0 0]
class               : [b'Third' b'First' b'Second' b'Third' b'Second']
deck                : [b'unknown' b'D' b'unknown' b'unknown' b'unknown']
alone               : [b'y' b'n' b'y' b'n' b'y']

 

Data Normalization in Tensorflow

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.

In the above dataset suppose we want to normalize the ‘age’ column. In that case, first, we have to import numpy to calculate the mean and standard deviation for normalization purposes.

To normalize a column we have to calculate the mean and standard deviation of the column. For each value in the column, the value is subtracted from the mean and is divided by the standard deviation.

To implement the above details we will create a function –

import numpy as np
def normalize(data, mean, std):
  return (data-mean)/std

To select the age column we will import pandas to read the age column and then pass that column into our ‘normalize’ function and we will visualize the data before normalization.

 

import pandas as pd
NUMERIC_FEATURES=['age']
x = pd.read_csv(train_file_path)[NUMERIC_FEATURES].head()

Output for the above code –

age
022.0
138.0
226.0
335.0
428.0

Now we will find the mean and standard deviation of the column using numpy and then we pass this age column in our normalize function and we will see the difference in the normalized column with the actual column.

MEAN=np.mean(x)
STD = np.std(x)
x=normalize_numeric_data(x,MEAN,STD)
x

The output for the above code is –

age
0 -1.326807
1  1.394848 
2 -0.646393 
3  0.884538
4 -0.306186

Hence we see that after normalization the numeric values of the column are limited to a certain range and this is how we can normalize any numeric feature columns in TensorFlow.

Also read:

2 responses to “Load CSV Data using tf.data and Data Normalization in Tensorflow”

  1. Peter says:

    Great article, but it is normalize, not normalize_numeri_data when calling that last function 🙂

Leave a Reply