What is tf.data.Dataset.from_generator in TensorFlow?

Hotshot TensorFlow is here! In this article, we learn what the from_generator API does exactly in Python TensorFlow. 🙂

The Star of the day: from_generator in TensorFlow

The tf.data.Dataset.from_generator allows you to generate your own dataset at runtime without any storage hassles. It’s also helpful when you have a dataset that has features of different lengths like a sequence. But please don’t use it to increase the size of your dataset!

You can create a dataset whose elements are defined by a (generator) function. Whats a generator function? It yields/ returns values and we can invoke it in Python 3 by calling the built-in-next function with the generator object.

The parameters of tf.data.Dataset.from_generator are :

  1. generator: generator function that can be called and its arguments (args) can be specified later.
  2. output_types : tf.Dtype of the elements yielded by the generator function. For eg : tf.string, tf.bool, tf.float32, tf.int32
  3. output_shapes (Optional) : tf.TensorShape of the elements yielded by the generator function.
  4. args(Optional): A tuple that will serve as np array arguments to the generator function.
import tensorflow as tf
import numpy as np

def sample_gen(sample):
    if sample == 1:
        for i in range(5):
            yield 2*i
    elif sample == 2:
        for i in range(5):
            yield (10 * i, 20 * i)
    elif sample == 3:
        for i in range(1, 4):
            yield (i, ['The Lion wants food'] * i)

sample_iter = sample_gen(1)
next(sample_iter)
next(sample_iter)
#Output = 2
sample_iter = sample_gen(3)
next(sample_iter)
#Output = (1, ['The Lion wants food'])
next(sample_iter)
#Output = (2, ['The Lion wants food', 'The Lion wants food'])

Here I have defined a generator function sample_gen() with conditional outputs and called next to access its values consecutively.

Let’s create our first dataset which will look like this:

data1 = tf.data.Dataset.from_generator(sample_gen,(tf.int32), args = ([1]))

 #Output type = int.32 as the sample_gen function returns integers when sample == 1 as defined by args

 #To use this dataset we need the make_initializable_iterator()

iter = data1.make_initializable_iterator()

element = iter.get_next()

with tf.Session() as sess:
    sess.run(iter.initializer)
    print(sess.run(element))
    print(sess.run(element))
    print(sess.run(element))

# Output Dataset =
0
2
4

When there are multiple arrays/arrays are of different lengths :

data2= tf.data.Dataset.from_generator( sample_gen, (tf.int32 , tf.int32), args = ([2]))

#args ==2 and specifying int 32 for the tuple values ....

#Output Dataset= 
(0, 0)
(10, 20)
(20, 40)

data3= tf.data.Dataset.from_generator( sample_gen, (tf.int32 , tf.string), args = ([3]))

#args == 3 and specifying int 32 , string type fo the tuple values....

#Output Dataset= 
(1, array([b'The Lion wants food'], dtype=object))
(2, array([b'The Lion wants food', b'The Lion wants food'], dtype=object))
(3, array([b'The Lion wants food', b'The Lion wants food',
       b'The Lion wants food'], dtype=object))


That’s all for today!

Also read: Load CSV Data using tf.data and Data Normalization in Tensorflow

Leave a Reply

Your email address will not be published. Required fields are marked *