Build TensorFlow to get FREE performance increase from CPU

Minyang Chen
6 min readMar 5, 2019
Tensorflow logo

After spent some time learning Tensorflow and setup my PC and laptop with a stock version of Tensorflow built without optimization flags along with Keras and Jupyter notebook. One thing I notice when I ran my notebook model as a bunch of warnings like this in the console output something like this “Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2”. Since this a warning level message.

Initially I was ignore it because it doesn’t stop any training work. Until I read a article about modern CPUs provide a lot of low-level arithmetic instructions, also known as extensions such as SSE2, SSE4, AVX, etc. See your CPU vendor spec on exact extension support some older CPU may only support AVX, but not AVX2 or FMA.

What does these extension really help? Well, the warning message states that your CPU does support AVX, FMA..etc which is good thing. One particular one that is related to the warning message is AVX (Advanced Vector Extension). this extension introduces fused multiply-accumulate (FMA) operations, which speed up linear algebra computation like dot-product, matrix multiply, convolution, etc.

As we learn from machine learning, almost every machine-learning training these operations, hence will be faster on a CPU that supports AVX and FMA could help speed up to tensorflow arithmetic operation up to 300%.

Knowing Tensorflow GPU version always outperform CPU version many times. This is not a direct comparison, rather optimized your CPU with FREE performance increase when you don’t have GPU present, particular useful when running on a laptop without Nvidia gpu installed.

Therefore, I think is worthwhile to build Tensorflow from source to get extra FREE performance boots from CPU. So, let’s dive into the details. So, let’s discuss few available options.

Option-1 Download a pre-build version from Intel

Intel build TensorFlow using Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) primitives to deliver maximum performance in your CPU. To use it just run the pip install.

$ pip install intel-tensorflow

See link here for more details https://software.intel.com/en-us/articles/intel-optimization-for-tensorflow-installation-guide

Option-2 Download a pre-build tensorflow binaries with extension flags

Let’s start with a easy and time saving approach is download a pre-built TensorFlow binaries that supporting FMA, AVX, AVX2, SSE4.1, SSE4.2.

Check out: https://github.com/lakshayg/tensorflow-build , this repository contains custom builds of tensorflow. To install one of these on your system, download the correct file according to your version of python and gcc and run the following command.

Try use these pre build packages, it works then you should see no warning messages and performance boots on model training speed. How do you measure performance gain from this optimized build? I will how to test the performance increase later on.

However, if your CPU doesn’t have all extension support then you still need to build from source with extension supported by your particular CPU.

Option -3 Building tensorflow from source

All details instruction is provide by google tensorflow building source : https://www.tensorflow.org/install/source

At high level main steps can be group as following

  • Install python dependencies
  • Install Bazel build tool
  • Configure environment and options
  • Run Bazel build
  • Build the PIP package
  • Install the package

Here’s the short version of building steps I used on my CPU build.

0. pre-req

pip install -U — user pip six numpy wheel mock

pip install -U — user keras_applications==1.0.6 — no-deps

pip install -U — user keras_preprocessing==1.0.5 — no-deps

1. install bezel

Install Bazel, the build tool used to compile TensorFlow.

https://docs.bazel.build/versions/master/install.html

2. clone source

git clone https://github.com/tensorflow/tensorflow.git

cd tensorflow

./configure

3. Build source

# use this build flag if you CPU support all of them

bazel build -c opt — copt=-mavx — copt=-mavx2 — copt=-mfma — copt=-mfpmath=both — copt=-msse4.2 -k //tensorflow/tools/pip_package:build_pip_package

# for my older PC, with i5–3x CPU choose only supported flags

bazel build -c opt — copt=-mavx — copt=-mfpmath=both — copt=-msse4.2 -k //tensorflow/tools/pip_package:build_pip_package

4. Build wheel package

./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

5. Install

pip install /tmp/tensorflow_pkg/tensorflow-version-tags.whl

WARNING:

Please note, building tensorflow take long time, I mean could be couple hours or more. It’s the longest build process I ever seen. So, don’t turn off your PC or laptop. Let’s the build completely finish.

Performance Testing

Now, we have a CPU optimized Tensorflow build deploy, how do we test the performance gain? Re-run your training model compare the training time could be one way to do this. Or alternatively, we can use some of the training samples in tensorflow.

The quickest test I tried is following:

  1. git clone https://github.com/tensorflow/models.git
  2. python3 models/tutorials/image/cifar10/cifar10_train.py

The output result show something like how many example can process per seconds

Test Result: Stock Tensorflow v1.12 (CPU I7–4770 )

To construct input pipelines, use the `tf.data` module.

2019–03–05 10:35:53.387879: step 0, loss = 4.68 (717.7 examples/sec; 0.178 sec/batch)

2019–03–05 10:35:55.529421: step 10, loss = 4.62 (597.7 examples/sec; 0.214 sec/batch)

2019–03–05 10:35:57.593186: step 20, loss = 4.48 (620.2 examples/sec; 0.206 sec/batch)

2019–03–05 10:35:59.714303: step 30, loss = 4.38 (603.5 examples/sec; 0.212 sec/batch)

2019–03–05 10:36:01.819969: step 40, loss = 4.35 (607.9 examples/sec; 0.211 sec/batch)

2019–03–05 10:36:03.911619: step 50, loss = 4.34 (612.0 examples/sec; 0.209 sec/batch)

2019–03–05 10:36:05.985446: step 60, loss = 4.21 (617.2 examples/sec; 0.207 sec/batch)

Note: stock build result is not bad at 600+ samples per second.

Test Result-2: Using Intel build — Tensorflow

WARNING:tensorflow:From /workdisk/ts_intel/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py:804: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.

Instructions for updating:

To construct input pipelines, use the `tf.data` module.

2019–03–05 10:38:30.533268: step 0, loss = 4.67 (96.4 examples/sec; 1.328 sec/batch)

2019–03–05 10:38:34.676802: step 10, loss = 4.62 (308.9 examples/sec; 0.414 sec/batch)

2019–03–05 10:38:38.104557: step 20, loss = 4.49 (373.4 examples/sec; 0.343 sec/batch)

2019–03–05 10:38:41.844844: step 30, loss = 4.37 (342.2 examples/sec; 0.374 sec/batch)

2019–03–05 10:38:45.081309: step 40, loss = 4.38 (395.5 examples/sec; 0.324 sec/batch)

2019–03–05 10:38:48.811644: step 50, loss = 4.25 (343.1 examples/sec; 0.373 sec/batch)

2019–03–05 10:38:52.501662: step 60, loss = 4.29 (346.9 examples/sec; 0.369 sec/batch)

2019–03–05 10:38:55.862010: step 70, loss = 4.22 (380.9 examples/sec; 0.336 sec/batch)

2019–03–05 10:38:59.404764: step 80, loss = 4.19 (361.3 examples/sec; 0.354 sec/batch)

Note: the result is pretty bad at 350+ samples per second much lower than stock tensorflow build. One possible reason could be I am not using Anaconda.

Test Result-3: Using local build from source

Instructions for updating:

To construct input pipelines, use the `tf.data` module.

I0305 10:43:01.926581 140641691903808 basic_session_run_hooks.py:594] Saving checkpoints for 0 into /tmp/cifar10_train/model.ckpt.

2019–03–05 10:43:03.043505: step 0, loss = 4.68 (708.6 examples/sec; 0.181 sec/batch)

2019–03–05 10:43:04.535182: step 10, loss = 4.66 (858.1 examples/sec; 0.149 sec/batch)

2019–03–05 10:43:05.947437: step 20, loss = 4.52 (906.4 examples/sec; 0.141 sec/batch)

2019–03–05 10:43:07.332444: step 30, loss = 4.60 (924.2 examples/sec; 0.139 sec/batch)

2019–03–05 10:43:08.704526: step 40, loss = 4.42 (932.9 examples/sec; 0.137 sec/batch)

2019–03–05 10:43:10.092247: step 50, loss = 4.29 (922.4 examples/sec; 0.139 sec/batch)

2019–03–05 10:43:11.520808: step 60, loss = 4.23 (896.0 examples/sec; 0.143 sec/batch)

2019–03–05 10:43:12.937116: step 70, loss = 4.30 (903.8 examples/sec; 0.142 sec/batch)

2019–03–05 10:43:14.332398: step 80, loss = 4.26 (917.4 examples/sec; 0.140 sec/batch)

Note: wow, what a result in performance improvement. 900+ samples per second.

Lastly, for sake for comparison in tensorflow performance between CPU and GPU. it’’s workwile ran same test in tensorflow with GPU enable.

Here’s the result on same CPU (i7–4770 with GXT 2070) running in tensorflow 1.12. As you see below, performance is incredible on GPU…. but at cost of course.

— — -

2019–03–05 10:50:21.313014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6742 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)

WARNING:tensorflow:From /workdisk/gpupy3env/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py:804: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.

Instructions for updating:

To construct input pipelines, use the `tf.data` module.

2019–03–05 10:50:24.073767: step 0, loss = 4.68 (393.0 examples/sec; 0.326 sec/batch)

2019–03–05 10:50:24.257177: step 10, loss = 4.63 (6978.9 examples/sec; 0.018 sec/batch)

2019–03–05 10:50:24.359443: step 20, loss = 4.47 (12516.3 examples/sec; 0.010 sec/batch)

2019–03–05 10:50:24.453675: step 30, loss = 4.61 (13583.4 examples/sec; 0.009 sec/batch)

2019–03–05 10:50:24.548656: step 40, loss = 4.37 (13476.4 examples/sec; 0.009 sec/batch)

2019–03–05 10:50:24.644636: step 50, loss = 4.32 (13336.2 examples/sec; 0.010 sec/batch)

2019–03–05 10:50:24.738154: step 60, loss = 4.19 (13687.0 examples/sec; 0.009 sec/batch)

Conclusion:

As result show, it’s very clear local build from source code boost performance quite a bit in comparison with the stock build. Where stock build getting around 603+ samples per second then local build is hitting 900+ samples per second that is 30% free performance increase in same hardware. And GPU is the best option in term of performance.

I hope you find this article useful, it you like it clap. Or leave me a comment or suggestion. Have a beautiful day.

Thanks — Min yang

--

--

Minyang Chen

Enthusiastic in AI, Cloud, Big Data and Software Engineering. Sharing insights from my own experiences.