Improving Adversarial Robustness of Ensembles with Diversity Training
Abstract
Deep Neural Networks are vulnerable to adversarial attacks even in settings where the attacker has no direct access to the model being attacked. Such attacks usually rely on the principle of transferability, whereby an attack crafted on a surrogate model tends to transfer to the target model. We show that an ensemble of models with misaligned loss gradients can provide an effective defense against transferbased attacks. Our key insight is that an adversarial example is less likely to fool multiple models in the ensemble if their loss functions do not increase in a correlated fashion. To this end, we propose Diversity Training, a novel method to train an ensemble of models with uncorrelated loss functions. We show that our method significantly improves the adversarial robustness of ensembles and can also be combined with existing methods to create a stronger defense.
1 Introduction
Despite achieving state of the art classification accuracies on a wide variety of tasks, it has been demonstrated that deep neural networks can be fooled into misclassifying an input that has been adversarially perturbed (Szegedy et al., 2013; Goodfellow et al., 2015). These adversarial perturbations are small enough to go unnoticed by humans but can reliably fool deep neural networks. The existence of adversarial inputs presents a security vulnerability in the deployment of deep neural networks for real world applications such as selfdriving cars, online content moderation and malware detection. It is important to ensure that the models used in these applications are robust to adversarial inputs as failure to do so can have severe consequences ranging from loss in revenue to loss of lives.
A number of attacks have been proposed which use the gradient information of the model to figure out the perturbations to a benign input that would make it adversarial. These attacks require access to the model parameters and are termed whitebox attacks. Fortunately, several real world applications of deep learning do not expose the model parameters to the end user, making it harder for an adversary to attack the model. However, adversarial examples have been shown to transfer across different models (Papernot et al., 2016a), enabling adversarial attacks without knowledge of the model architecture or parameters. Attacks that work under these constraints are termed blackbox attacks.
1.1 Transferability
Blackbox attacks rely on the principle of transferability. In the absence of access to the target model, the adversary trains a surrogate model and crafts adversarial attacks on this model using whitebox attacks. Adversarial examples generated this way can be used to fool the target model with a high probability of success (Liu et al., 2016). Adversarial examples have been known to span a large contiguous subspace of the input (Goodfellow et al., 2015). Furthermore, recent work explaining transferability (Tramèr et al., 2017) has shown that models with high dimensionality of adversarial subspace (AdvSS) are more likely to intersect, causing adversarial examples to transfer between models. Our goal is to find an effective way to reduce the dimensionality of the AdvSS and hence reduce transferability of adversarial examples. We show that this can be done by using an ensemble of diversely trained models.
1.2 Ensemble as an Effective Defense
An ensemble uses a collection of different models and aggregates their outputs for classification. Adversarial robustness can potentially be improved using an ensemble, as the attack vector must now fool multiple models in the ensemble instead of just a single model. In other words, the attack vector must now lie in the shared AdvSS of the models in the ensemble. We explain this using the Venn diagrams in Figure 1. The area enclosed by the rectangle represents a set of orthogonal perturbations spanning the space of the input and the circle represents the subset of these orthogonal perturbations that are adversarial (i.e. cause the model to misclassify the input). The shaded region represents the AdvSS. For a single model (Figure 1a), any perturbation that lies in the subspace defined by the circle would cause the model to misclassify the input. In contrast, for an ensemble of 3 models, the adversarial input must fool more than one model in the ensemble for the attack to be successful. This implies that the attack vector must lie in the shared AdvSS of the ensemble as shown in Figure 1b.
While prior works have used ensemble of models in various forms to defend against adversarial attacks (Liu et al., 2017; Strauss et al., 2018), we show that the efficacy of ensembles can be improved significantly by explicitly forcing a reduction in the shared AdvSS of the models as shown in Figure 1c. Reducing this overlap translates to a reduction in the overall dimensionality of the AdvSS of the ensemble. Thus, there are fewer directions of adversarial perturbations that can cause multiple models in the ensemble to misclassify, resulting in a reduction of transferability and improved adversarial robustness.
1.3 Contributions
We study the use of ensembles in the context of blackbox attacks and propose a technique to improve adversarial robustness. Overall, we make the following key contributions:

We identify that the adversarial robustness of an ensemble can be improved by reducing the dimensionality of the shared adversarial subspace

We propose Gradient Alignment Loss(GAL), a metric to measure the shared adversarial subspace of the models in an ensemble.

We show that GAL can be used as a regularizer to train an ensemble of diverse models with misaligned loss gradients. We call this Diversity Training.

We show empirically that Diversity Training makes ensembles more robust to transferbased attacks.
2 Background
In this section we formally define the attack model and describe the various attacks considered in our work.
2.1 Adversarial Examples
A benign input can be transformed into an adversarial input by adding a carefully crafted perturbation
(1) 
For an untargeted attack, the adversary’s objective is to cause the model to misclassify the perturbed input such that , where is the groundtruth label of . However, this perturbation must not result in a perceivable change to the input for a human observer. Following prior work on adversarial machine learning for image classification (Goodfellow et al., 2015; Madry et al., 2017), we enforce this constraint by restricting the norm of the perturbation to be below a threshold i.e. .
2.2 BlackBox Attack Model
This work considers the blackbox attack model in which the attacker does not know the parameters of the target model. We assume however that the adversary has access to the dataset used for training and knows the architecture of the model being attacked. To attack a target model trained on dataset , the adversary trains a surrogate model using the same dataset. Adversarial examples crafted on the surrogate model using whitebox attacks can be used to attack the target model using the principle of transferability. We briefly describe the various attack algorithms considered in our evaluations that can be used for this purpose.
2.3 Attack Algorithms
Given full access to the model parameters, the adversary can craft an adversarial example by considering the loss function. Let denote the loss function of the model , where represents the model parameters, is the benign input and is the label. The attacker’s goal is to generate an adversarial example with the objective of maximizing the model’s loss function, such that , while adhering to the constraint: . Several techniques have been proposed in literature to solve this constrained optimization problem. We discuss the ones used in the evaluation of our defense.
Fast Gradient Sign Method (FGSM): FGSM (Goodfellow et al., 2015) uses a linear approximation of the loss function to find an adversarial perturbation that causes the loss function to increase. Let denote the gradient of the loss function with respect to . The input is modified by adding a perturbation of size in the direction of the gradient vector.
(2) 
Several variants of FGSM have been proposed in recent literature with the goal of improving the effectiveness of the perturbation in increasing the loss function.
Random StepFGSM (RFGSM): This method proposes to take a single random step followed by the application of FGSM. (Tramèr et al., 2017) hypothesized that the loss function tends to be nonsmooth near data points. Thus, taking a random step before applying FGSM should improve the quality of the gradient.
IterativeFGSM (IFGSM): Instead of taking just one step in the direction of the gradient, (Kurakin et al., 2016) proposed taking multiple smaller steps ( steps of size ) with gradient being computed after each step.
(3)  
(4) 
Momentum IterativeFGSM (MIFGSM): (Dong et al., 2017) observed that IterativeFGSM can be improved to avoid poor local minima by considering the momentum information of gradient. They propose MIFGSM which uses an exponential moving average of the gradient (i.e. momentum) to compute the direction of perturbation in each iteration.
(5) 
is the decay factor used to compute the moving average of the gradients. is computed as shown in Eqn.(4). This attack won the first place in the NIPS 2017 adversarial attack competition.
PGDCW: This is a variant of the IFGSM attack with the hinge loss function suggested by (Carlini & Wagner, 2016). Similar to (Madry et al., 2017), we use Projected Gradient Descent (PGD) to maximize the loss function, which ensures that the attacked image lies within the ball around the natural image.
3 Diversity Training
3.1 Approach
The goal of our work is to improve the adversarial robustness of the model to blackbox attacks by reducing the transferability of adversarial examples. This can be achieved by reducing the dimensionality of the AdvSS of the target model using an ensemble of diverse models. Our approach to training an ensemble of diverse models is as follows:

Find a way to measure the dimensionality of AdvSS of the ensemble

Use this measure as a regularization term to train an ensemble of diverse models.
The rest of this section describes the two parts of our solution in greater detail.
3.2 Measuring Adversarial Subspaces
Adversarial examples have been known to span a contiguous subspace of the input. Several methods such as Gradient Aligned Adversarial Subspace (Tramèr et al., 2017) and Local Intrinsic Dimensionality (Ma et al., 2018) have been proposed in recent literature to measure the dimensionality of this subspace. Unfortunately, these methods cannot be used in the cost function during training as computing them is either expensive or involves the use of nondifferentiable functions. For our purposes, we want a computationally inexpensive way of measuring the dimensionality of AdvSS using a differentiable function. This would allow us to run backpropagation through the function and use it as a regularization term during training.
3.2.1 AdvSS of an Ensemble
Our proposal is to use an ensemble of models and thus we are interested in measuring the AdvSS of the ensemble instead of a single model. For an input to be adversarial, it has to fool multiple models in the ensemble, requiring the example to lie in the shared AdvSS of multiple models as shown in Figure 1b. Thus, the overall dimensionality of the AdvSS of an ensemble is proportional to the amount of overlap in the AdvSS of the individual models. We propose a novel method to measure this overlap by considering the alignment of the loss gradient vectors of the models.
3.2.2 Gradient Alignment
We first describe how the overlap of AdvSS can be measured between two models and then show that our idea can be generalized to an ensemble of models. Consider two models and . Let and denote the gradient of the loss functions of the two models with respect to the input .
The gradient describes the direction in which the input has to be perturbed to maximally increase the loss function (locally around ). If the gradients of the two models are aligned, their loss functions increase in a correlated fashion, which means that a perturbation that causes to increase would likely also cause to increase. This indicates that the two models have similar adversarial directions and hence have a large shared AdvSS as shown in Figure 2a. In other words, adversarial examples that fool are also likely to fool . Conversely, if the gradients of the two models are misaligned, the perturbations that cause to increase do not cause to increase. Thus and are unlikely to be fooled by the same perturbations implying that there is a reduction in the dimensionality of the shared AdvSS of the two models as illustrated in Figure 2b.
Thus, alignment of gradients can be used as a proxy to measure the amount of overlap in the AdvSS of two models. A straightforward way to measure the alignment of gradients is to compute the cosine similarity (CS) between them.
(6) 
Cosine similarity has values in the range . For two models, we would ideally like the cosine similarity to be 1 so that the gradients are completely misaligned and there is no overlap in the adversarial subspaces between the two models. For an ensemble of N models, we want the set of N gradient vectors to be maximally misaligned. One way of measuring the amount of alignment for a set of vectors is by considering their coherence value (Tropp, 2006). Coherence measures the maximum cosine similarity between unique pairs of vectors in the set. We define coherence as shown in Eqn. 7.
(7) 
Coherence can be computed by taking the pairwise cosine similarity with vectors in and considering the over all the cosine similarity terms. Since this is a nonsmooth function, minimizing Eqn. 7 using first order methods like gradient descent would be slow. The rate of convergence can be improved by using a smooth approximation of this function (Nesterov, 2005). We replace in Eqn. 7 with as shown in Eqn. 8.
(8) 
We call this term the Gradient Alignment Loss (GAL). GAL can be used to approximate coherence and hence provides a way to measure the degree of overlap in the AdvSS of the models in the ensemble.
3.3 Diversity Training
If the models in an ensemble have a low GAL value for input , it becomes harder to generate an adversarial example that fool multiple models . We call such a collection of models with low GAL value a diverse ensemble. In order to encourage ensembles to have low GAL, we propose using it as part of the cost function during training as a regularization term. We term this training procedure Diversity Training (DivTrain). Eqn. 9 shows the modified loss function.
(9) 
The first term is the average cross entropy (CE) loss of each model in the ensemble and the second term represents the GAL. is a hyperparameter that controls the importance given to GAL during training i.e. a large value of would improve adversarial robustness at the cost of clean accuracy. DivTrain lowers the dimensionality of the AdvSS of the ensemble by reducing the overlap in the AdvSS of the individual models. This reduces the transferability of adversarial examples and improves the adversarial robustness of the target model.
3.4 Problem of Sparse Gradients
The selection of the activation function is an important design choice for the effective training of the network with GAL regularization. Using ReLu nonlinearity in our networks, for example, causes the lossgradient vector to have a large number of zero values. This is because the derivative of the ReLu activation is zero valued in the saturating regime of the input () as shown in Figure 3a. Since computing the GAL involves taking the inner product of terms, we end up with a large number of zerovalued product terms. This poses a problem for backpropagation since gradients don’t flow through zerovalued products, causing the gradients to vanish and preventing the network from being trained. Our solution is to use an activation function which does not suffer from this problem like LeakyReLu instead. LeakyRelu (Figure 3b) has the desirable property of having a non zero gradient value regardless of the value of the input. This reduces the sparsity of the lossgradient vector and improves the effectiveness of GAL regularization.
4 Experiments
We conduct experiments using the MNIST and CIFAR10 datasets to validate our claims of improved robustness to transferbased blackbox attacks with DivTrain. We start by describing our experimental setup in Section 4.1. Experimental results, including evaluations of combining DivTrain with the existing state of the art blackbox defense is presented in Section 4.2. We compare the distributions of coherence values between diverse and baseline ensembles in Section 4.3. Finally, we provide results for the Gradient Aligned Adversarial Subspace analysis in Section 4.4 and show that DivTrain reduces the dimensionality of adversarial subspace of the ensemble.
4.1 Setup
Our experiments evaluate the robustness of a Target model to blackbox attacks. The attack is carried out by conducting whitebox attacks on a Surrogate model and checking the accuracy of the adversarial examples generated in this way on . We assume that the adversary has knowledge of the network architecture and the dataset used to train but not its model parameters. Hence, in our evaluations, we use models trained separately on the same dataset and having the same network architecture for and . We evaluate the effectiveness of DivTrain by comparing the adversarial robustness between a baseline ensemble () trained with CE loss, without any regularization and a Diverse Ensemble () trained with GAL regularization. An ensemble with 5 models is used for both the target and surrogate model. We found that using an ensemble as a surrogate produced more transferable adversarial examples as compared to a single model. This is in line with the observations made by (Dong et al., 2017; Liu et al., 2016).
Model  Structure 

Conv3  C32C64MC128MFC1024FC10 
Conv4  C32C64C128MC128MFC1024FC10 
ResNet20  C16  3x{RES16RES32RES64}  FC10 
ResNet26  C16  4x{RES16RES32RES64}  FC10 
Model  Target(T)  Clean  FGSM  RFGSM  IFGSM  MIFGSM  PGDCW 

Conv3 (mnist)  99.4  91.4 / 9.7  92.0 / 9.7  86.1 / 0.7  85.7 / 2.6  92.3 / 9.7  
99.2  97.1 / 34.3  97.8 / 30.6  97.6 / 20.4  96.9 / 16.9  97.1 / 35.9  
99.4  98.9 / 61.3  99.0 / 42.5  99.0 / 56.3  98.8 / 45.9  98.8 / 44.8  
99.3  98.9 / 73.7  99.0 / 79.3  99.0 / 87.0  98.8 / 61.4  98.2 / 71.3  
Conv4 (cifar10)  85.1  14.1 / 7.8  16.8 / 3.2  9.5 / 2.8  9.0 / 7.4  8.8 / 5.9  
82.4  45.3 / 14.7  56.0 / 15.1  51.4 / 5.6  35.0 / 7.5  43.9 / 11.6  
82.9  64.6 / 43.2  70.5 / 54.9  69.4 / 54.3  59.9 / 38.6  62.1 / 42.8  
80.5  68.5 / 54.2  72.0 / 66.7  72.4 / 66.3  66.9 / 55.4  66.9 / 54.3  
Resnet20 (cifar10)  88.9  28.8 / 13.1  25.7 / 7.1  8.6 / 3.2  10.2 / 6.3  18.7 / 10.2  
84.0  58.4 / 32.4  64.3 / 23.9  67.7 / 44.2  50.0 / 11.7  53.2 / 25.4  
87.9  70.9 / 44.3  77.2 / 50.9  79.6 / 65.5  66.5 / 30.5  65.9 / 37.8  
84.7  74.9 / 50.7  78.1 / 57.6  79.7 / 71.5  74.1 / 47.4  71.3 / 46.3  
Mix (cifar10)  89.7  27.9 / 9.6  30.9 / 6.1  13.9 / 3.1  10.6 / 5.9  26.0 / 7.6  
88.2  55.8 / 23.2  65.1 / 25.0  61.8 / 19.7  42.4 / 7.1  55.9 / 22.2  
87.4  72.6 / 49.6  76.9 / 58.4  77.1 / 61.4  66.9 / 27.9  70.0 / 47.1  
86.4  73.4 / 52.2  77.9 / 64.1  77.2 / 67.5  69.2 / 39.7  71.8 / 50.4 
Network Configuration: Table 1 lists the structure of the models used in our experiments. We use neural networks consisting of Convolutional(C), Maxpooling (M) and Fully Connected (FC) layers. The Conv networks have Convolutional layers and maxpooling layers. The ResNet structure is similar to the one described in (He et al., 2016). The network has total of layers consisting of Residual blocks (RES) with skip connections after every two layers. LeakyRelu is used as the nonlinearity after each convolutional layer for all the networks.
The models are trained using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of . From our training dataset , we dynamically generate a augmented dataset using random shifts and crops for MNIST and random shifts, flips and crops for CIFAR10. In addition, we generate a dataset by adding a perturbation drawn from a truncated normal distribution: to the images in with for MNIST and for CIFAR10. The combined dataset is used to train diverse ensembles. We care about reducing the coherence of primarily around natural images. Adding Gaussian noise to the training images allows us to sample from a distribution that covers our region of interest in the input space.
Conv3 is trained for 10 epochs on MNIST dataset. Conv4 and Resnet20 are trained for 20 and 40 epochs respectively on CIFAR10. In addition, we also evaluate our defense with an ensemble consisting of mixture of different model architectures. The MIX ensemble is made up of {Resnet26, Resnet20, Conv4, Conv3, Conv3} and is trained on CIFAR10. We set for DivTrain (Equation 9) as we empirically found this to offer a good tradeoff between clean accuracy and adversarial robustness.
Attack Configuration: We evaluate the accuracy of the target models for clean examples as well as adversarial examples crafted from blackbox attacks using the attacks described in Section 2.3 on Surrogate . Since is an ensemble of 5 models, we consider the average CE loss of all the models in as the objective function to be maximized for the attacks. We briefly describe the parameters for each attack: FGSM takes a single step of magnitude in the direction of the gradient. RFGSM applies FGSM after taking a single random step sampled from a uniform distribution: . For IFGSM and MIFGSM, we use steps with each step of size . The decay factor is set to for MIFGSM. We evaluate PGDCW with a confidence parameter and iterations of optimization considering the hingeloss function from (Carlini & Wagner, 2016). Results are reported for two different perturbation sizes for all the attacks: for MNIST and for CIFAR10. We use the Cleverhans library’s (Papernot et al., 2018) implementation of the attacks to evaluate our defense.
4.2 Results
Our results comparing BaselineEnsemble () and DiverseEnsemble () are shown in Table 2 with the most effective attack highlighted in bold. has a significantly higher classification accuracy on adversarial examples compared to for all the attacks considered, showing that DivTrain improves the adversarial robustness of ensembles against blackbox attacks. The classification accuracy of clean examples drops slightly as a consequence of adding the regularization term (GAL) to the cost function.
There are several defenses proposed in recent literature to defend against blackbox attacks. Since DivTrain is a defense that is generally applicable to any ensemble of models, it can be combined with existing proposals to create a stronger defense. We show that this is possible by evaluating a combination of our method with Ensemble Adversarial Training (Tramèr et al., 2017) , which is the current state of the art method for Black box defense.
Combined Defense: Ensemble Adversarial Training (EnsAdvTrain ) proposes to improve adversarial robustness by augmenting the training dataset with adversarial examples generated from a static pretrained model. We use a pretrained model with the same architecture as the target model. By attacking the pretrained model with FGSM, we generate an adversarial dataset . The target models are trained on the combined dataset consisting of both clean and adversarial examples . Perturbation sizes used to generate the adversarial examples are randomly determined from a truncated normal distribution to ensure adversarial robustness against various perturbation sizes as suggested by (Kurakin et al., 2016).
We denote the ensemble trained with EnsAdvTrain as and the ensemble trained with the combination of EnsAdvTrain and DivTrain as . Results showing the classification accuracy of and under various attacks are provided in Table 2. The combined defense offers higher classification accuracy under attack compared with either of the two defenses / used alone. This shows that DivTrain can be combined with existing methods such as EnsAdvTrain to create a stronger defense.
4.3 Distribution of Coherence
Diversity Training encourages models to have uncorrelated loss functions by reducing the coherence among their gradient vectors. Figure 4 compares the distribution of coherence values (see Eqn. 7) for the different target models used in our evaluations. The histograms show that and have lower coherence values compared to and . Thus, our proposed GAL regularization is an effective way of training models with misaligned gradient vectors which can be used to create ensembles with improved adversarial robustness to blackbox attacks.
4.4 Gradient Aligned Adversarial Subspace
We provide further evidence that DivTrain lowers the dimensionality of the AdvSS of the ensemble by using the Gradient Aligned Adversarial Subspace (GAAS) (Tramèr et al., 2017) analysis. GAAS measures the dimensionality of the AdvSS by aligning a set of orthogonal vectors with the gradient of the loss function in order to find a maximal set of orthogonal adversarial perturbations. The orthogonal vectors are constructed by multiplying the row vectors from a Regular Hadamard matrix componentwise with . For a Regular Hadamard matrix of order , this yields a set of orthogonal vectors aligned with the gradient. We run the GAAS analysis on the Conv4 model and use . Figure 5 compares the probability of finding successful orthogonal directions for and . We repeat the analysis with different perturbation sizes (). Our results show that has fewer orthogonal adversarial directions compared to showing that DivTrain can effectively lower the dimensionality of the AdvSS of ensembles.
5 Related Work
The susceptibility of deep neural networks to adversarial inputs has sparked considerable research interest in finding ways to make deep learning models robust to adversarial attacks. As a result, a number of methods have been proposed to defend against whitebox attacks (Papernot et al., 2015; Goodfellow et al., 2015; Na et al., 2017; Buckman et al., 2018; Guo et al., 2017; Ma et al., 2018; Dhillon et al., 2018; Xie et al., 2018; Song et al., 2017; Samangouei et al., 2018). However, a fair majority of these defenses have been shown to be ineffective against adaptive attacks that are tailormade to work well against specific defenses. A recent work (Athalye et al., 2018) showed that a number of these attacks rely on some form of gradient masking (Papernot et al., 2016b), whereby the defense makes the gradient information unavailable to the attacker, making it harder to craft adversarial examples. It also proposes techniques to tackle the problem of obfuscated gradients that break defenses which use gradient masking.
Prior works have considered the use of ensembles to improve adversarial robustness. (Strauss et al., 2018) use ensembles with the intuition that adversarial examples that lead to misclassification in one model may not fool other models in the ensemble. (Liu et al., 2017) use noise injection to the layers of the neural network to prevent gradient based attacks and ensembles predictions over random noises. While both of these works benefit from the adversarial robustness offered by ensembles as discussed in Section 1.2, the contribution of our work is to further improve this robustness by explicitly encouraging the models in the ensemble to have uncorrelated loss functions.
The idea of using cosine similarity of gradients to measure the correlation between loss functions has been explored in (Du et al., 2018). This was used in the context of improving the usefulness of auxiliary tasks to improve data efficiency. In contrast, we develop a metric based on cosine similarity that can measure the similarity among a set of gradient vectors with the objective of measuring the overlap of adversarial subspaces between the models in an ensemble.
6 Discussion
Our paper explores the use of diverse ensembles with uncorrelated loss functions to better defend against transferbased attacks. We briefly discuss other problem settings where our idea can potentially be used.
Adversarial Attack Detection: The objective here is to detect inputs that have been adversarially tampered with and flag such inputs. A recent work (Bagnall et al., 2017) uses ensembles to detect adversarial inputs by training them to have high disagreement for inputs outside the training distribution. Since GAL minimizes the coherence of gradient vectors, it can potentially be used for the same purpose. One possible approach would be to use a modified version of DivTrain which uses GAL as the cost function (without crossentropy loss) for examples outside the training distribution so that the consensus among the members of the ensemble would be low for outofdistribution data.
Better BlackBox Attacks: Ensembles can also be used to generate better blackbox attacks. (Dong et al., 2017; Liu et al., 2016) use an ensemble of models as the surrogate with the intuition that adversarial examples that fool multiple models tend to be more transferable. It would be interesting to study the transferability of adversarial examples generated on diverseensembles to see if they can enable better blackbox attacks. We leave the evaluation of both of these ideas for a future work.
7 Conclusion
Transferbased attacks present an important challenge for the secure deployment of deep neural networks in real world applications. We explore the use of ensembles to defend against this class of attacks with the intuition that it is harder to fool multiple models in an ensemble if they have uncorrelated loss functions. We propose a novel regularization term called Gradient Alignment Loss that helps us train models with uncorrelated loss function by minimizing the coherence among their gradient vectors. We show that this can be used to train a diverse ensemble with improved adversarial robustness by reducing the amount of overlap in the shared adversarial subspace of the models. Furthermore, our proposal can be combined with existing methods to create a stronger defense. We believe that reducing the dimensionality of the adversarial subspace is important to creating a strong defense and Diversity Training is a technique that can help achieve this goal.
Acknowledgements
We thank our colleagues from the Memory Systems Lab for their feedback. This work was supported by a gift from Intel. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.
References
 Athalye et al. (2018) Athalye, A., Carlini, N., and Wagner, D. A. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. CoRR, abs/1802.00420, 2018. URL http://arxiv.org/abs/1802.00420.
 Bagnall et al. (2017) Bagnall, A., Bunescu, R. C., and Stewart, G. Training ensembles to detect adversarial examples. CoRR, abs/1712.04006, 2017. URL http://arxiv.org/abs/1712.04006.
 Buckman et al. (2018) Buckman, J., Roy, A., Raffel, C., and Goodfellow, I. Thermometer encoding: One hot way to resist adversarial examples. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S18SuCW.
 Carlini & Wagner (2016) Carlini, N. and Wagner, D. A. Towards evaluating the robustness of neural networks. CoRR, abs/1608.04644, 2016. URL http://arxiv.org/abs/1608.04644.
 Dhillon et al. (2018) Dhillon, G. S., Azizzadenesheli, K., Lipton, Z. C., Bernstein, J., Kossaifi, J., Khanna, A., and Anandkumar, A. Stochastic activation pruning for robust adversarial defense. CoRR, abs/1803.01442, 2018. URL http://arxiv.org/abs/1803.01442.
 Dong et al. (2017) Dong, Y., Liao, F., Pang, T., Hu, X., and Zhu, J. Discovering adversarial examples with momentum. CoRR, abs/1710.06081, 2017. URL http://arxiv.org/abs/1710.06081.
 Du et al. (2018) Du, Y., Czarnecki, W. M., Jayakumar, S. M., Pascanu, R., and Lakshminarayanan, B. Adapting Auxiliary Losses Using Gradient Similarity. arXiv eprints, art. arXiv:1812.02224, December 2018.
 Goodfellow et al. (2015) Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6572.
 Guo et al. (2017) Guo, C., Rana, M., Cissé, M., and van der Maaten, L. Countering adversarial images using input transformations. CoRR, abs/1711.00117, 2017. URL http://arxiv.org/abs/1711.00117.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
 Kurakin et al. (2016) Kurakin, A., Goodfellow, I. J., and Bengio, S. Adversarial machine learning at scale. CoRR, abs/1611.01236, 2016. URL http://arxiv.org/abs/1611.01236.
 Liu et al. (2017) Liu, X., Cheng, M., Zhang, H., and Hsieh, C. Towards robust neural networks via random selfensemble. CoRR, abs/1712.00673, 2017. URL http://arxiv.org/abs/1712.00673.
 Liu et al. (2016) Liu, Y., Chen, X., Liu, C., and Song, D. Delving into transferable adversarial examples and blackbox attacks. CoRR, abs/1611.02770, 2016. URL http://arxiv.org/abs/1611.02770.
 Ma et al. (2018) Ma, X., Li, B., Wang, Y., Erfani, S. M., Wijewickrema, S. N. R., Houle, M. E., Schoenebeck, G., Song, D., and Bailey, J. Characterizing adversarial subspaces using local intrinsic dimensionality. CoRR, abs/1801.02613, 2018. URL http://arxiv.org/abs/1801.02613.
 Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. CoRR, abs/1706.06083, 2017. URL http://arxiv.org/abs/1706.06083.
 Na et al. (2017) Na, T., Ko, J. H., and Mukhopadhyay, S. Cascade Adversarial Machine Learning Regularized with a Unified Embedding. arXiv eprints, art. arXiv:1708.02582, August 2017.
 Nesterov (2005) Nesterov, Y. Smooth minimization of nonsmooth functions. Mathematical Programming, 103(1):127–152, May 2005. ISSN 14364646. doi: 10.1007/s1010700405525. URL https://doi.org/10.1007/s1010700405525.
 Papernot et al. (2015) Papernot, N., McDaniel, P. D., Wu, X., Jha, S., and Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. CoRR, abs/1511.04508, 2015. URL http://arxiv.org/abs/1511.04508.
 Papernot et al. (2016a) Papernot, N., McDaniel, P. D., and Goodfellow, I. J. Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. CoRR, abs/1605.07277, 2016a. URL http://arxiv.org/abs/1605.07277.
 Papernot et al. (2016b) Papernot, N., McDaniel, P. D., Goodfellow, I. J., Jha, S., Celik, Z. B., and Swami, A. Practical blackbox attacks against deep learning systems using adversarial examples. CoRR, abs/1602.02697, 2016b. URL http://arxiv.org/abs/1602.02697.
 Papernot et al. (2018) Papernot, N., Faghri, F., Carlini, N., Goodfellow, I., Feinman, R., Kurakin, A., Xie, C., Sharma, Y., Brown, T., Roy, A., Matyasko, A., Behzadan, V., Hambardzumyan, K., Zhang, Z., Juang, Y.L., Li, Z., Sheatsley, R., Garg, A., Uesato, J., Gierke, W., Dong, Y., Berthelot, D., Hendricks, P., Rauber, J., and Long, R. Technical report on the cleverhans v2.1.0 adversarial examples library. arXiv preprint arXiv:1610.00768, 2018.
 Samangouei et al. (2018) Samangouei, P., Kabkab, M., and Chellappa, R. Defensegan: Protecting classifiers against adversarial attacks using generative models. CoRR, abs/1805.06605, 2018. URL http://arxiv.org/abs/1805.06605.
 Song et al. (2017) Song, Y., Kim, T., Nowozin, S., Ermon, S., and Kushman, N. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. CoRR, abs/1710.10766, 2017. URL http://arxiv.org/abs/1710.10766.
 Strauss et al. (2018) Strauss, T., Hanselmann, M., Junginger, A., and Ulmer, H. Ensemble methods as a defense to adversarial perturbations against deep neural networks, 2018. URL https://openreview.net/forum?id=rkA1f3NpZ.
 Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013. URL http://arxiv.org/abs/1312.6199.
 Tramèr et al. (2017) Tramèr, F., Kurakin, A., Papernot, N., Boneh, D., and McDaniel, P. D. Ensemble adversarial training: Attacks and defenses. CoRR, abs/1705.07204, 2017. URL http://arxiv.org/abs/1705.07204.
 Tramèr et al. (2017) Tramèr, F., Papernot, N., Goodfellow, I., Boneh, D., and McDaniel, P. The space of transferable adversarial examples. arXiv, 2017. URL https://arxiv.org/abs/1704.03453.
 Tropp (2006) Tropp, J. A. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Trans. Information Theory, 52(3):1030–1051, 2006. doi: 10.1109/TIT.2005.864420. URL https://doi.org/10.1109/TIT.2005.864420.
 Xie et al. (2018) Xie, C., Wang, J., Zhang, Z., Ren, Z., and Yuille, A. Mitigating adversarial effects through randomization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Sk9yuql0Z.