Why does tf.keras model.fit() initialize take so long? How can it be optimized?

Question

Using tensorflow.keras (2.0-alpha0 with GPU support) I have extremely long initialize times with tf.keras.model.fit() on both newly compiled models and models previously saved and reloaded.

I believe this is after the tf.data.Datasets() have already been loaded and preprocessed, so I don't understand what is taking so long and there is no output from TF/Keras:

2019-04-19 23:29:18.109067: tensorflow/core/common_runtime/gpu/gpu_device.cc:1149] Created TensorFlow device
Resizing images and creating data sets with num_parallel_calls=8
Loading existing model to continue training.
Starting model.fit()
Epoch 1/100
2019-04-19 23:32:22.934394: tensorflow/core/kernels/data/shuffle_dataset_op.cc:150] Shuffle buffer filled.
2019-04-19 23:38:52.374924: tensorflow/core/common_runtime/bfc_allocator.cc:230] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.62GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

3 minutes to load the model and fill the shuffle buffer and 6 minutes for ... what? And how can this mysterious work be optimized? (5ghz 8700K, 32 GB RAM, NVME SSD, 1080ti 11G DDR5 - task manager shows 100% single-thread CPU use, moderate disk access, slowly expanding RAM usage to ~28GB max, zero GPU usage during this period).

Is there any way to serialize or store the models in a more efficient way such that they can be started and stopped regularly without the 10 minutes of overhead?

Is TF/Keras somehow lazy-loading the data sets and preprocessing them in this period?

explodingfilms101 · Accepted Answer · 2021-06-28 15:15:08Z

It looks like an issue with using multiple workers for tf.data.Datasets(). From the logging messages, it shows that you're using 8 parallel processes, which would explain why you're showing such high CPU/RAM usage. So this isn't a problem with the model.

To my knowledge, the first time you use Datasets should be fairly slow, but it will get faster after the data gets cached.

If the model.fit() call is still starting very slowly, you can tune down the number of processes to 4 or 2. That might impact your training time as your SSD could slow down due to having to load the data.

Collectives™ on Stack Overflow

Why does tf.keras model.fit() initialize take so long? How can it be optimized?

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
python-3.x
tensorflow
tf.keras
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python-3.xtensorflowtf.keras or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python-3.x
tensorflow
tf.keras
or ask your own question.