Published on

TensorFlow 2.3 from source on Fedora 32

Actually worked
Authors

TensorFlow 2.3 from source

The drive for building from source was that :

  • my Nvidia Titan X (Maxwell) GPU has a Compute Capability of 5.2, which is no longer supported by prebuilt TensorFlow;
  • Negativo (my preferred Nvidia driver repo) had moved its cuda version on from 10.0 to 10.2

Install / Check the nvidia packages

The following rpm packages are required (your versions may differ, but the packages should be there). As root :

dnf config-manager --add-repo=https://negativo17.org/repos/fedora-nvidia.repo
# Get the cuda drivers
dnf install cuda cuda-devel cuda-cudnn-devel nvidia-driver-cuda
# Get the appropriate gcc version (eg: ~gcc 8.3)
dnf install cuda-gcc cuda-gcc-c++ cuda-gcc-gfortran
# Install this - will be made use of if detected
dnf install blas-devel

Check on the installed versions :

# https://github.com/negativo17/cuda
rpm -qa | grep nvidia-driver-devel
# "" == Not present, since I don't need use the Nvidia card to drive the display
rpm -qa | grep cuda-devel
# cuda-devel-10.2.89-2.fc32.x86_64
rpm -qa | grep cuda-cudnn-devel
# cuda-cudnn-devel-7.6.5.32-1.fc32.x86_64

Prepare user-land set-up

Building TensorFlow needs several preparatory steps :

  • Create a virtualenv so that Python knows which version it's building for
  • Set up the defaults correctly (using export rather than tedious CLI interaction)
  • Build a pip package with bazel (iterate to fix the problems...)
  • Install the pip package

Set up Python : python-3.8 virtualenv

NB: Do this outside the eventual tensorflow source tree:

virtualenv-3.8 --system-site-packages ~/env38
. ~/env38/bin/activate

virtualenv set-up

# https://www.tensorflow.org/install/source
pip install -U pip six 'numpy<1.19.0' wheel setuptools mock 'future>=0.17.1'
pip install -U keras_applications --no-deps
pip install -U keras_preprocessing --no-deps

Download tensorflow at a specific release

As a regular user :

git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow/

# https://github.com/tensorflow/tensorflow/releases
git checkout v2.3.0

bazel installation via bazelisk

As a regular user, make bazelisk available and invocable as bazel :

# https://github.com/bazelbuild/bazelisk
#   5.9Mb download
wget https://github.com/bazelbuild/bazelisk/releases/download/v1.1.0/bazelisk-linux-amd64
mv bazelisk-linux-amd64 ~/env38/bin/bazel
chmod 754 ~/env38/bin/bazel

# Seems to unpack stuff..

# Check the version required:
grep _BAZEL_VERSION tensorflow/configure.py

#USE_BAZEL_VERSION=0.29.1  # Used for TFv2.1
USE_BAZEL_VERSION=3.4.1
export USE_BAZEL_VERSION

bazel version
#Build label: 3.4.1
#Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
#Build time: Tue Jul 14 06:27:53 2020 (1594708073)

./configure machine compilation defaults

The following is the command-line configuration equivalent to answering a lot of in-line questions (which resist being automated otherwise) :

# https://github.com/tensorflow/tensorflow/issues/7542#issue-207940753
export PYTHON_BIN_PATH=`which python`
export PYTHON_LIB_PATH=`dirname \`which python\``/../lib/python3.8/site-packages

export TF_ENABLE_XLA=1
export TF_NEED_CUDA=1
export TF_NEED_TENSORRT=0
export TF_NEED_OPENCL=0
export TF_NEED_OPENCL_SYCL=0
export TF_NEED_ROCM=0
export TF_NEED_HDFS=0

#https://en.wikipedia.org/wiki/CUDA
#  5.2 ~ Titan X (Maxwell)
#  6.1 ~ 1060
#  7.5 ~ 2070S
export TF_CUDA_COMPUTE_CAPABILITIES=5.2  # TitanX
#export TF_CUDA_COMPUTE_CAPABILITIES=5.2,6.1,7.5  # TitanX, 1060 and 2070

export TF_CUDA_CLANG=0
export GCC_HOST_COMPILER_PATH=`which gcc`
export CC_OPT_FLAGS="-march=native -Wno-sign-compare"
export TF_SET_ANDROID_WORKSPACE=0

./configure

#You have bazel 3.4.1 installed.
#Found CUDA 10.2 in:
#    /usr/lib64
#    /usr/include/cuda
#Found cuDNN 7 in:
#    /usr/lib64
#    /usr/include/cuda

( Fixes required to successfully compile )

Fix an apparent mistake (surely not??) in the bazel configuration documented here :

# Manual method :
#joe ./third_party/gpus/cuda_configure.bzl  L621 :: stub="" # always

# Reproducible method :
mkdir -p /usr/lib64/stubs/
ln -s /usr/lib64/libcuda.so /usr/lib64/stubs/libcuda.so
ls -l  /usr/lib64/stubs/

Also, fix the gcc version to make it compatible with cuda :

# Bad method :
#joe /usr/include/cuda/crt/host_config.h  L138 ::  >8 -> >18

# Better methods (requires `dnf install cuda-gcc ...` from above)
pushd /usr/local/bin/
ln -s /usr/bin/cuda-gcc gcc
ln -s /usr/bin/cuda-g++ g++
ln -s /usr/bin/cuda-gcc-gfortran gcc-gfortran
popd

The latter version needs to be 'undone' after a successful build.

bazel build the pip package (builds tensorflow too)

This took over 4 hours (when it worked cleanly) on an 8-thread i7 machine (with 16GB memory and codebase on SSD):

bazel build //tensorflow/tools/pip_package:build_pip_package
#INFO: Elapsed time: 15012.545s, Critical Path: 358.50s
#INFO: 24110 processes: 24110 local.
#INFO: Build completed successfully, 35202 total actions
# 15012sec ~= 4h10m

Build the pip whl package itself

This creates the 'wheel' in /tmp/tensorflow_pkg, and then installs it into the env38 :

./bazel-bin/tensorflow/tools/pip_package/build_pip_package ./tensorflow_pkg
# Takes ~1 minute, creates a 205MB whl file in ./tensorflow_pkg

pip install -U ./tensorflow_pkg/tensorflow-*.whl

Test the install

Run python within the env38 environment to get a python prompt, and :


# See : https://www.tensorflow.org/beta/guide/using_gpu
import tensorflow as tf

tf.debugging.set_log_device_placement(True)

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], shape=[2, 3], name='a')
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]], shape=[3, 2], name='b')
c = tf.matmul(a, b)

# Eager mode FTW!
print(c)

print(c.device)  # Hope for : /job:localhost/replica:0/task:0/device:GPU:0

Post-install cool-down

Un-fix the gcc version that made it compatible with cuda (unless you've got other stuff that also needs to be linked with cuda to compile):

pushd /usr/local/bin/
rm gcc g++ gcc-gfortran
popd

All Done!