Hot questions for Using Neural networks in adversarial machines

Top 10 Python Open Source / Neural networks / adversarial machines


I am reading an article that explains how to trick neural networks into predicting any image you want. I am using the mnist dataset.

The article provides a relatively detailed walk through but the person who wrote it is using Caffe.

Anyways, my first step was to create a logistic regression function using TensorFlow that is trained on the mnist dataset. So, if I were to restore the logistic regression model I can use it to predict any image. For example, I feed the number 7 to the following model...

with tf.Session() as sess:  
    saver.restore(sess, "/tmp/model.ckpt")
    # number 7
    x_in = np.expand_dims(mnist.test.images[0], axis=0)
    classification =, 1), feed_dict={x:x_in})


This prints out the number [7] which is correct.

Now the article explains that in order to break a neural network we need to calculate the gradient of the neural network. This is the derivative of the neural network.

The article states that to calculate the gradient, we first need to pick an intended outcome to move towards, and set the output probability list to be 0 everywhere, and 1 for the intended outcome. Backpropagation is an algorithm for calculating the gradient.

Then there's code provided in Caffe as to how to calculate the gradient...

def compute_gradient(image, intended_outcome):
    # Put the image into the network and make the prediction
    # Get an empty set of probabilities
    probs = np.zeros_like(net.blobs['prob'].data)
    # Set the probability for our intended outcome to 1
    probs[0][intended_outcome] = 1
    # Do backpropagation to calculate the gradient for that outcome
    # and the image we put in
    gradient = net.backward(prob=probs)
    return gradient['data'].copy()

Now, my issue is, I'm having a hard time understanding how this function is able to get the gradient just by feeding just the image and the probabilities to the function. Because I do not fully understand this code, I am having a hard time translating this logic to TensorFlow.

I think I am confused as to how the Caffe framework works because I've never seen/used it before. If someone could explain how this logic works step-by-step that would be great.

I already know the basics of Backpropagation so you may assume I already know how it works.

Here is a link to the article itself...


I'm going to show you how to do the basics of generating an adversarial image in TF, to apply that to an already learned model you might need some adaptations.

The code blocks work well as blocks in a Jupyter notebook if you want to try this out interactively. If you don't use a notebook, you'll need to add calls for the plots to show and remove the matplotlib inline statement. The code is basically the simple MNIST tutorial from the TF documentation, I'll point out the important differences.

First block is just setup, nothing special ...

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

# if you're not using jupyter notebooks then comment this out
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

Get MNIST data (it is down from time to time so you might need to download it from manually and put it into that directory). We're not using one hot encoding like in the tutorial because by now TF has nicer functions to calculate the loss that don't need the one hot encoding anymore.

mnist = input_data.read_data_sets('/tmp/tensorflow/mnist/input_data')

In the next block we are doing something "special". The input image tensor is defined as a variable because later we want to optimize with regard to the input image. Usually you would have a placeholder here. It does limit us a bit here because we need a definite shape so we only feed in one example at a time. Not something you want to do in production, but for teaching purposes it's fine (and you can get around it with a little more code). Labels are placeholders like normal.

input_images = tf.get_variable("input_image", shape=[1,784], dtype=tf.float32)
input_labels = tf.placeholder(shape=[1], name='input_label', dtype=tf.int32)

Our model is a standard logistic regression model like in the tutorial. We only use the softmax for visualization of results, the loss function takes plain logits.

W = tf.get_variable("weights", shape=[784, 10], dtype=tf.float32, initializer=tf.random_normal_initializer())
b = tf.get_variable("biases", shape=[1, 10], dtype=tf.float32, initializer=tf.zeros_initializer())

logits = tf.matmul(input_images, W) + b
softmax = tf.nn.softmax(logits)

The loss is standard cross entropy. What's to note in the training step is that there is an explicit list of variables passed in - we have defined the input image as a training variable but we don't want to try optimizing the image while training the logistic regression, just weights and biases - so we explicitly state that.

loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=input_labels,name='xentropy')

mean_loss = tf.reduce_mean(loss)

train_step = tf.train.AdamOptimizer(learning_rate=0.1).minimize(mean_loss, var_list=[W,b])

Start the session ...

sess = tf.Session()

Training is slower than it should be because of batch size 1. Like I said, not something you want to do in production, but this is just for teaching the basics ...

for step in range(10000):
    batch_xs, batch_ys = mnist.train.next_batch(1)
    loss_v, _ =[mean_loss, train_step], feed_dict={input_images: batch_xs, input_labels: batch_ys})

At this point we should have a model that is good enough to demonstrate how to generate an adversarial image. First, we get an image that has label '2' because these are easy so even our suboptimal classifier should get them right (if it doesn't, run this cell again ;) this step is random so I can't guarantee that it'll work).

We're setting our input image variable to that example.

sample_label = -1
while sample_label != 2:
    sample_image, sample_label = mnist.test.next_batch(1)
plt.imshow(sample_image.reshape(28, 28),cmap='gray')

# assign image to var, sample_image)); # now using the variable as input, no feed dict

# should show something like
# array([[ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]], dtype=float32)
# With the third entry being the highest by far.

Now we are going to "break" the classification. We want to change the image to make it look more like another number, in the eyes of the network, without changing the network itself. To do that, the code looks basically identical to what we had before. We define a "fake" label, the same loss as before (cross entropy) and we get an optimizer to minimize the fake loss, but this time with a var_list consisting of only the input image - so we won't change the logistic regression weights:

fake_label = tf.placeholder(tf.int32, shape=[1])
fake_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=fake_label)
adversarial_step = tf.train.GradientDescentOptimizer(learning_rate=1e-3).minimize(fake_loss, var_list=[input_images])

The next block is intended to be run interactively multiple times, while you see the image and the scores changing (here moving towards a label of 8):, feed_dict={fake_label:np.array([8])})

The first time you run this block, the scores will probably still heavily point towards 2, but it will change over time and after a couple runs you should see something like the following image - note that the image still looks like a 2 with some noise in the background, but the score for "2" is at around 3% while the score for "8" is at over 96%.

Note that we never actually computed the gradient explicitly - we don't need to, the TF optimizer takes care of computing gradients and applying updates to the variables. If you want to get the gradient, you can do so by using tf.gradients(fake_loss, input_images).

The same pattern works for more complicated models, but what you'll want to do is to train your model as normal - using placeholders with bigger batches, or using a pipeline with TF readers, and when you want to do the adversarial image you'd recreate the network with the input image variable as an input. As long as all the variable names remain the same (which they should if you use the same functions to build the network) you can restore using your network checkpoint, and then apply the steps from this post to get to an adversarial image. You might need to play around with learning rates and such.


I've been experimenting with adversarial images and I read up on the fast gradient sign method from the following link

The instructions explain that the necessary gradient can be calculated using backpropagation...

I've been successful at generating adversarial images but I have failed at attempting to extract the gradient necessary to create an adversarial image. I will demonstrate what I mean.

Let us assume that I have already trained my algorithm using logistic regression. I restore the model and I extract the number I wish to change into a adversarial image. In this case it is the number 2...

# construct model
logits = tf.matmul(x, W) + b
pred = tf.nn.softmax(logits)
# assign the images of number 2 to the variable, labels_of_2))
# setup softmax

# placeholder for target label
fake_label = tf.placeholder(tf.int32, shape=[1])
# setup the fake loss
fake_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,labels=fake_label)

# minimize fake loss using gradient descent,
# calculating the derivatives of the weight of the fake image will give the direction of weights necessary to change the prediction
adversarial_step = tf.train.GradientDescentOptimizer(learning_rate=FLAGS.learning_rate).minimize(fake_loss, var_list=[x])

# continue calculating the derivative until the prediction changes for all 10 images
for i in range(FLAGS.training_epochs):
    # fake label tells the training algorithm to use the weights calculated for number 6, feed_dict={fake_label:np.array([6])})

This is my approach, and it works perfectly. It takes my image of number 2 and changes it only slightly so that when I run the following...

x_in = np.expand_dims(x[0], axis=0)
classification =, 1))

it will predict the number 2 as a number 6.

The issue is, I need to extract the gradient necessary to trick the neural network into thinking number 2 is 6. I need to use this gradient to create the nematode mentioned above.

I am not sure how can I extract the gradient value. I tried looking at tf.gradients but I was unable to figure out how to produce an adversarial image using this function. I implemented the following after the fake_loss variable above...

tf.gradients(fake_loss, x)

for i in range(FLAGS.training_epochs):
    # calculate gradient with weight of number 6
    gradient_value =, feed_dict={fake_label:np.array([6])})
    # update the image of number 2
    gradient_update = x+0.007*gradient_value[0], gradient_update))

Unfortunately the prediction did not change in the way I wanted, and moreover this logic resulted in a rather blurry image.

I would appreciate an explanation as to what I need to do in order calculate and extract the gradient that will trick the neural network, so that if I were to take this gradient and apply it to my image as a nematode, it will result in a different prediction.


Why not let the Tensorflow optimizer add the gradients to your image? You can still evaluate the nematode to get the resulting gradients that were added.

I created a bit of sample code to demonstrate this with a panda image. It uses the VGG16 neural network to transform your own panda image into a "goldfish" image. Every 100 iterations it saves the image as PDF so you can print it losslessly to check if your image is still a goldfish.

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipyd
from libs import vgg16 # Download here!

pandaimage = plt.imread('panda.jpg')
pandaimage = vgg16.preprocess(pandaimage)

img_4d = np.array([pandaimage])

g = tf.get_default_graph()
input_placeholder = tf.Variable(img_4d,trainable=False)
to_add_image = tf.Variable(tf.random_normal([224,224,3], mean=0.0, stddev=0.1, dtype=tf.float32))
combined_images_not_clamped = input_placeholder+to_add_image

filledmax = tf.fill(tf.shape(combined_images_not_clamped), 1.0)
filledmin = tf.fill(tf.shape(combined_images_not_clamped), 0.0)
greater_than_one = tf.greater(combined_images_not_clamped, filledmax)

combined_images_with_max = tf.where(greater_than_one, filledmax, combined_images_not_clamped)
lower_than_zero =tf.less(combined_images_with_max, filledmin)
combined_images = tf.where(lower_than_zero, filledmin, combined_images_with_max)

net = vgg16.get_vgg_model()
tf.import_graph_def(net['graph_def'], name='vgg')
names = [ for op in g.get_operations()]

style_layer = 'prob:0'
the_prediction = tf.import_graph_def(
    input_map={'images:0': combined_images},return_elements=[style_layer])

goldfish_expected_np = np.zeros(1000)
goldfish_expected_tf = tf.Variable(goldfish_expected_np,dtype=tf.float32,trainable=False)
loss = tf.reduce_sum(tf.square(the_prediction[0]-goldfish_expected_tf))
optimizer = tf.train.AdamOptimizer().minimize(loss)

sess = tf.InteractiveSession()

def show_many_images(*images):
    fig = plt.figure()
    for i in range(len(images)):
        subplot_number = 100+10*len(images)+(i+1)

for i in range(1000):
    _, loss_val =[optimizer,loss])

    if i%100==1:
        print("Loss at iteration %d: %f" % (i,loss_val))
        _, loss_val,adversarial_image,pred,nematode =[optimizer,loss,combined_images,the_prediction,to_add_image])
        res = np.squeeze(pred)
        average = np.mean(res, 0)
        res = res / np.sum(average)
        print([(res[idx], net['labels'][idx]) for idx in res.argsort()[-5:][::-1]])
        plt.imsave('adversarial_goldfish.pdf',adversarial_image[0],format='pdf') # save for printing

Let me know if this helps you!


I am reading about adversarial images and breaking neural networks. I am trying to work through the article step-by-step but do to my inexperience I am having a hard time trying to understand the following instructions.

At the moment, I have a logistic regression model for the MNIST data set. If you give an image, it will predict the number that it most likely is...

saver.restore(sess, "/tmp/model.ckpt")
# image of number 7
x_in = np.expand_dims(mnist.test.images[0], axis=0)
classification =, 1), feed_dict={x:x_in})

Now, the article states that in order to break this image, the first thing we need to do is get the gradient of the neural network. In other words, this will tell me the direction needed to make the image look more like a number 2 or 3, even though it is a 7.

The article states that this is relatively simple to do using back propagation. So you may define a function...

compute_gradient(image, intended_label)

...and this basically tells us what kind of shape the neural network is looking for at that point.

This may seem easy to implement to those more experienced but the logic evades me.

From the parameters of the function compute_gradient, I can see that you feed it an image and an array of labels where the value of the intended label is set to 1.

But I do not see how this is supposed to return the shape of the neural network.

Anyways, I want to understand how I should implement this back propagation algorithm to return the gradient of the neural network. If the answer is not very straightforward, I would like some step-by-step instructions as to how I may get my back propagation to work as the article suggests it should.

In other words, I do not need someone to just give me some code that I can copy but I want to understand how I may implement it as well.


Back propagation involves calculating the error in the network's output (the cost function) as a function of the inputs and the parameters of the network, then computing the partial derivative of the cost function with respect to each parameter. It's too complicated to explain in detail here, but this chapter from a free online book explains back propagation in its usual application as the process for training deep neural networks.

Generating images that fool a neural network simply involves extending this process one step further, beyond the input layer, to the image itself. Instead of adjusting the weights in the network slightly to reduce the error, we adjust the pixel values slightly to increase the error, or to reduce the error for the wrong class.

There's an easy (though computationally intensive) way to approximate the gradient with a technique from Calc 101: for a small enough e, df/dx is approximately (f(x + e) - f(x)) / e.

Similarly, to calculate the gradient with respect to an image with this technique, calculate how much the loss/cost changes after adding a small change to a single pixel, save that value as the approximate partial derivative with respect to that pixel, and repeat for each pixel.

Then the gradient with respect to the image is approximately:

    (cost(x1+e, x2, ... xn) - cost(x1, x2, ... xn)) / e,
    (cost(x1, x2+e, ... xn) - cost(x1, x2, ... xn)) / e,
    (cost(x1, x2, ... xn+e) - cost(x1, x2, ... xn)) / e