Hot questions for Using Neural networks in nvidia

Question:

I'm trying to install Tensorflow with CUDA support. Here are my specs:

  • NVIDIA GTX 1070
  • CUDA 7.5
  • Cudnn v5.0

I have installed Tensorflow via the pip installation -- so I'm picturing your answer being to install from source, but I want to make sure there isn't a quick fix.

The error is:

volcart@volcart-Precision-Tower-7910:~$ python
Python 2.7.10 (default, Oct 14 2015, 16:09:02) 
[GCC 5.2.1 20151010] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/__init__.py", line 23, in <module>
    from tensorflow.python import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/__init__.py", line 98, in <module>
    from tensorflow.python.platform import test
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/test.py", line 77, in <module>
    import mock                # pylint: disable=g-import-not-at-top,unused-import
  File "/usr/local/lib/python2.7/dist-packages/mock/__init__.py", line 2, in <module>
    import mock.mock as _mock
  File "/usr/local/lib/python2.7/dist-packages/mock/mock.py", line 71, in <module>
    _v = VersionInfo('mock').semantic_version()
  File "/usr/local/lib/python2.7/dist-packages/pbr/version.py", line 460, in semantic_version
    self._semantic = self._get_version_from_pkg_resources()
  File "/usr/local/lib/python2.7/dist-packages/pbr/version.py", line 447, in _get_version_from_pkg_resources
    result_string = packaging.get_version(self.package)
  File "/usr/local/lib/python2.7/dist-packages/pbr/packaging.py", line 725, in get_version
    raise Exception("Versioning for this project requires either an sdist"
Exception: Versioning for this project requires either an sdist tarball, or access to an upstream git repository. Are you sure that git is installed?

I am running the python console from the home directory -- not in the Tensorflow directory.

GIT and CUDA both installed:

volcart@volcart-Precision-Tower-7910:~$ git --version
git version 2.5.0
volcart@volcart-Precision-Tower-7910:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

I verified CUDA is functional via this test (found here):

/usr/local/cuda/bin/cuda-install-samples-7.5.sh ~/cuda-samples
cd ~/cuda-samples/NVIDIA*Samples
make -j $(($(nproc) + 1))

Tensorflow successfully installs:

export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.10.0rc0-cp27-none-linux_x86_64.whl
sudo -H pip install --upgrade $TF_BINARY_URL

My GPU seems to be fine:

volcart@volcart-Precision-Tower-7910:~$ nvidia-smi
Thu Aug  4 17:31:47 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35                 Driver Version: 367.35                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 0000:03:00.0      On |                  N/A |
|  0%   41C    P8    12W / 185W |    499MiB /  8104MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0       900    G   /usr/bin/X                                     272MiB |
|    0      1679    G   compiz                                         154MiB |
|    0      2287    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    69MiB |
+-----------------------------------------------------------------------------+

Answer:

Its a bug in pbr. The bug description contains a solution to export pbr version:

export PBR_VERSION=X.Y.Z

The pbr version can be get as pbr -v.

Question:

I am using an AWS p3.2xlarge instance with the Deep Learning AMI (DLAMI). This instance has a single Tesla V100 (640 Tensor Cores and 5,120 CUDA Cores). When I run the PyTorch Seq2Seq Jupyter Notebook, I noticed that only 25% of the GPU is used. I monitor the GPU usage with the following command watch -n 1 nvidia-smi.

My question is, what determines GPU usage? Or, why is the GPU usage not 100%? The reason behind this question is related not only to inefficiency that may be a result of code but also cost ($3.06/hour). I am wondering if there is anything more that I can do to maximize the GPU usage.

Of course, this is a deep learning model that is being learned, and the training code sends one sample at a time through the network for learning. I am thinking that mini-batch learning may not be appropriate (e.g. sending a couple of samples through before backpropagating). I am also wondering if the network architecture (the number of layers, their parameters, their input tensor dimensions, etc.) constrains how GPU is being used. For example, if I add more layers or add more hidden nodes, should I expect GPU usage to go up?


Answer:

The power of GPUs over CPUs is to run many operations at the same time. However archiving this high level of parallelization is not always easy. Frameworks like Tensorflow or PyTorch do its best to optimise everything for GPU and parallelisation, but this is not possible for every case.

Computations in LSTMs and RNNs in general can be only parallelized to a very limited degree. The problem lies in their sequential structure, LSTMs and RNNs process only one input at a time, and they need to process everything in chronological order (to compute n+1 you always need to compute n before) - otherwise it wouldn't make sense.

So the natural way of processing data in RNNs is completely the opposite of parallelization, using mini-batching does help a lot, but does not solve the fundamental problem of LSTMs.

If you wan't a high amount of parallelization you need to use architectures like the "Transformer" proposed in the paper "Attention is all you need" by Google.

Summary

The degree of parallelization resp. the GPU acceleration of your model depends to a large extent on the architecture of the model itself. With some architectures like RNNs parallelization is only possible to a limited degree.

Edit:

For example, if I add more layers or add more hidden nodes, should I expect GPU usage to go up?

When increasing the number of units within you should expect the GPU usage going up, matrix operations like passing an input to a hidden layer are can be well parallelized.

Adding layers is different, there you have the same problem what causes RNNs to be slow on GPU. To compute the next layer you need to have already the result of the previous layer. So you need to compute one layer after another, it's not possible to compute all at the same time.

This is the theory - In practice you might see some minor differences in GPU usage, depending on the actual implementation of the framework.

Question:

So I am running PyTorch deep learning job using GPU but the job is pretty light.

My GPU has 8 GB but the job only uses 2 GB. Also GPU-Util is close to 0%.

|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   36C    P2    45W / 210W |   1155MiB /  8116MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

based on GPU-Util and memory I might be able to fit in another 3 jobs.

However, I am not sure if that will affect the overall runtime.

If I run multiple jobs on same GPU does that affects the overall runtime?

I think tried once and I think there was delay.


Answer:

Yes you can. One option is to use NVIDIA's Multi-Process Service (MPS) to run four copies of your model on the same card.

This is the best description I have found of how to do it: How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

If you are using your card for inference only, then you can host several models (either copies, or different models) on the same card using NVIDIA's TensorRT Inference Service.