StyleGAN 和 StyleGAN2 的深度理解

原文：StyleGAN 和 StyleGAN2 的深度理解 - 知乎 (zhihu.com)

论文：A Style-Based Generator Architecture for Generative Adversarial Networks

源码：https://github.com/NVlabs/stylegangithub.com/NVlabs/stylegan

效果：

人脸生成效果

生成的假车效果：

生成的假卧室效果：

效果视频（建议细看）：

算法概述：

StyleGAN中的“Style”是指数据集中人脸的主要属性，比如人物的姿态等信息，而不是风格转换中的图像风格，这里Style是指人脸的风格，包括了脸型上面的表情、人脸朝向、发型等等，还包括纹理细节上的人脸肤色、人脸光照等方方面面。具体的可以看上面给的效果视频链接，视频里不同的参数，可以控制人脸不同的“style”。

StyleGAN 用风格（style）来影响人脸的姿态、身份特征等，用噪声 ( noise ) 来影响头发丝、皱纹、肤色等细节部分。

the style have global effects (changing pose, identity, etc.), the noise affects only
inconsequential stochastic variation (differently combed hair, beard, etc.).

StyleGAN 的网络结构包含两个部分，第一个是Mapping network，即下图 (b)中的左部分，由隐藏变量 z 生成中间隐藏变量 w的过程，这个 w 就是用来控制生成图像的style，即风格，为什么要多此一举将 z 变成 w 呢，后面会详细讲到。第二个是Synthesis network，它的作用是生成图像，创新之处在于给每一层子网络都喂了 A 和 B，A 是由 w 转换得到的仿射变换，用于控制生成图像的风格，B 是转换后的随机噪声，用于丰富生成图像的细节，即每个卷积层都能根据输入的A来调整"style"。整个网络结构还是保持了 PG-GAN （progressive growing GAN） 的结构。最后论文还提供了一个高清人脸数据集FFHQ。

此外，传统的GAN网络输入是一个随机变量或者隐藏变量 z，但是StyleGAN 将 z 单独用 mapping网络将z变换成w，再将w投喂给 Synthesis network的每一层，因此Synthesis network中最开始的输入变成了常数张量，见下图b中的Const 4x4x512。

Mapping network --- latent code

Mapping network 要做的事就是对隐藏空间（latent space）进行解耦，关于latent space的相关解释，详看：

https://towardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8dtowardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8d

latent code 简单理解就是，为了更好的对数据进行分类或生成，需要对数据的特征进行表示，但是数据有很多特征，这些特征之间相互关联，耦合性较高，导致模型很难弄清楚它们之间的关联，使得学习效率低下，因此需要寻找到这些表面特征之下隐藏的深层次的关系，将这些关系进行解耦，得到的隐藏特征，即latent code。由 latent code组成的空间就是 latent space。

Mapping network由8个全连接层组成，通过一系列仿射变换，由 z 得到 w，这个 w 转换成风格 $y = (y_s, y_b)$ ，结合 AdaIN (adaptive instance normalization) 风格变换方法：

$x_i$ 表示每个特征图。

前面提到 Mapping network 是将 latent code z 变成 w，为什么要把 z 变成 w 呢，一般 z 是符合均匀分布或者高斯分布的随机向量，但在实际情况中，并不是这样，比如特征：头发的长度和男子气概，下图（a）中就是这两个特征的组合，左上角缺失的部分代表头发越长，男子气概越强，如果直接用均匀分布或者高斯分布对特征变量头发长度和男子气概进行采样，得到的结果都不准确，因此在（b）中将分布（a）warp 成连续的分布函数 f(z)，这个 f(z) 的密度是非均匀的，图 (c) 是 w 的分布。

the intermediate latent space W does not have to support sampling according to any fixed distribution; its sampling density is induced by the learned piecewise continuous mapping f(z). This mapping can be adapted to “unwarp”W so
that the factors of variation become more linear.
As such, we expect the training to yield a less entangled
W in an unsupervised setting, i.e., when the factors of
variation are not known in advance.

latent space interpolations

latent space interpolations 不是StyleGAN提到的，但在多篇paper中有提到，如下图的椅子，左边是比较宽的椅子，右边是比较窄的椅子，中间的椅子是这两种椅子特征的线性组合。

interpolation of latent-space vectors may yield surprisingly non-linear changes in the image. For example, features that are absent in either endpoint may appear in the middle of a linear interpolation path. This is a sign that the latent space is entangled and the factors of variation are not properly separated.

Style mixing

个人认为 Style mixing 是 StyleGAN 的一大亮点，先来看下什么是 Style mixing，下图中第一行是 source B，第一列是source A，source A 和 source B的每张图片由各自相应的latent code 生成，剩余的图片是对 source A 和 souce B 风格的组合。 Style mixing 的本意是去找到控制不同style的latent code的区域位置，具体做法是将两个不同的latent code $z_1$ 和 $z_2$ 输入到 mappint network 中，分别得到 $w_1$ 和 $w_2$ ，分别代表两种不同的 style，然后在 synthesis network 中随机选一个中间的交叉点，交叉点之前的部分使用 $w_1$ ，交叉点之后的部分使用 $w_2$ ，生成的图像应该同时具有 source A 和 source B 的特征，称为 style mixing。

根据交叉点选取位置的不同，style组合的结果也不同。下图中分为三个部分，第一部分是 Coarse styles from source B，分辨率(4x4 - 8x8)的网络部分使用B的style，其余使用A的style, 可以看到图像的身份特征随souce B，但是肤色等细节随source A；第二部分是 Middle styles from source B，分辨率(16x16 - 32x32)的网络部分使用B的style，这个时候生成图像不再具有B的身份特性，发型、姿态等都发生改变，但是肤色依然随A；第三部分 Fine from B，分辨率(64x64 - 1024x1024)的网络部分使用B的style，此时身份特征随A，肤色随B。由此可以大致推断，低分辨率的style 控制姿态、脸型、配件比如眼镜、发型等style，高分辨率的style控制肤色、头发颜色、背景色等style。

Stochastic variation

论文中的 Stochastic variation 是为了让生成的人脸的细节部分更随机、更自然，细节部分主要指头发丝、皱纹、皮肤毛孔、胡子茬等。如下图。

实现这种 Stochastic variation 的方法就是引入噪声，StyleGAN的做法是在每一次卷积操作后都加入噪声，下图是不同网络层加入噪声的对比。

Perceptual path length

图像生成其实是学习从一个分布到目标分布的迁移过程，如下图，已知input latent code 是z1，或者说白色的狗所表示的latent code是z1，目标图像是黑色的狗，黑狗图像的latent code 是 z2，图中蓝色的虚线是z1 到 z2 最快的路径，绿色的曲线是我们不希望的路径，在蓝色的路径中的中间图像应该是z1 和 z2 的组合，假设这种组合是线性的（当特征充分解耦的时候），蓝色路径上生成的中间图像也是狗（ 符合 latent-space interpolation），但是绿色的曲线由于偏离路径太多，生成的中间图像可能是其他的，比如图上的卧室，这是我们不希望的结果。

补充一下，我们可以通过训练好的生成模型得到给定图像的latent code，假设我们有一个在某个数据集上训练好的styleGAN模型，现在要找到一张图像 x 在这个模型中的latent code，设初始latent code 是 z，生成的初始图像是p，通过 p 和 x 之间的差距设置损失函数，通过损失不断去迭代 z，最后得到图像x的latent code。

Perceptual Path Length (PPL) is an indicator of whether the image changes smoothly in “perceptual”. Uses the distance of the image embedded in the trained model, similar to FID.Roughly speaking, it indicates whether the image changes on the shortest “perceptual” path in the latent space.

Perceptual path length 是一个指标，用于判断生成器是否选择了最近的路线（比如上图蓝色虚线），用训练过程中相邻时间节点上的两个生成图像的距离来表示，公式如下：

g 表示生成器，d 表示判别器， $f$ 表示mapping netwrok， $f(z_1)$ 表示由latent code $z_1$ 得到的中间隐藏码 $w$ ， $w \in W$ ， $t$ 表示某一个时间点， $t \in (0, 1)$ , $t + \varepsilon$ 表示下一个时间点，lerp 表示线性插值（linear interpolation），即在 latent space上进行插值。

Truncation Trick

Truncation Trick 不是StyleGAN提出来的，它很早就在GAN里用于图像生成了，感兴趣的可以追踪溯源。从数据分布来说，低概率密度的数据在网络中的表达能力很弱，直观理解就是，低概率密度的数据出现次数少，能影响网络梯度的机会也少，但并不代表低概率密度的数据不重要。可以提高数据分布的整体密度，把分布稀疏的数据点都聚拢到一起，类似于PCA，做法很简单，首先找到数据中的一个平均点，然后计算其他所有点到这个平均点的距离，对每个距离按照统一标准进行压缩，这样就能将数据点都聚拢了，但是又不会改变点与点之间的距离关系。

If we consider the distribution of training data, it is clear that areas of low density are poorly represented and thus likely to be difficult for the generator to learn. This is a significant open problem in all generative modeling techniques.
However, it is known that drawing latent vectors from a truncated or otherwise shrunk sampling space tends to improve average image quality.

$\bar{w} 是 W 的中心点$ ，截断或者压缩后的（truncated）w' 公式如下：

$\psi$ 是一个实数，表示压缩倍数，下图是truncation对style的影响。

generator 中各个改进点的贡献，见下表：

StyleGAN的细节就到这里了，接下来重点看StyleGAN2。

StyleGAN2

paper：

Analyzing and Improving the Image Quality of StyleGAN

code：

https://github.com/NVlabs/stylegan2github.com/NVlabs/stylegan2

StyleGAN2 的出现当然是因为StyleGAN存在瑕疵，少量生成的图片有明显的水珠，这个水珠也存在于feature map上，如下图：

发现了问题，就开始定位问题，最后找到了，原论文如下，

We pinpoint the problem to the AdaIN operation that normalizes the mean and variance of each feature map separately, thereby potentially destroying any information found in the magnitudes of the features relative to each other.

导致水珠的原因是 Adain 操作，Adain对每个feature map进行归一化，因此有可能会破坏掉feature之间的信息。emmmm，懵懵懂懂，最重要的是实验证明当去除Adain的归一化操作后，水珠就消失了，所以Adain就背锅吧。

With supports from the experimental results, the changes in StyleGAN2 made include:

Remove (simplify) how the constant is processed at the beginning.
The mean is not needed in normalizing the features.
Move the noise module outside the style module.

上面两张图说明了从 styleGAN 到 styleGAN2 ，在网络结构上的变换，去除normalization之后水珠消失了，但是styleGAN的一个亮点是 style mixing，仅仅只改网络结构，虽然能去除水珠，但是无法对style mixing 有 scale-specific级别的控制，原论文如下：

In practice, style modulation may amplify certain feature maps by an order of magnitude or more. For style mixing towork, we must explicitly counteract this amplification on a per-sample basis—otherwise the subsequent layers would
not be able to operate on the data in a meaningful way.
If we were willing to sacrifice scale-specific controls (see video), we could simply remove the normalization, thus removing the artifacts and also improving FID slightly.
We will now propose a better alternative that removes the artifacts while retaining full controllability.

除了网络结构上的改进，还有就是 Weight demodulation，公式如下：

改进后的效果如下，水珠消失了。

StyleGAN2 的网络结构：

图中 dlatents_out 表示 disentangled latent code，即 $w$ 。

StyleGAN2 的改进点：

Lazy regularization

损失是由损失函数和正则项组成，优化的时候也是同时优化这两项的，lazy regularization就是正则项可以减少优化的次数，比如每16个minibatch才优化一次正则项，这样可以减少计算量，同时对效果也没什么影响。

Path length regularization

在生成人脸的同时，我们希望能够控制人脸的属性，不同的latent code能得到不同的人脸，当确定latent code变化的具体方向时，该方向上不同的大小应该对应了图像上某一个具体变化的不同幅度。为了达到这个目的，设计了 Path length regularization ，它的原理也很简单，在图像上的梯度用图像乘上变换的梯度来表示，下列公式中 $w$ 表示由latent code $z$ 得到的disentangled latent code， $y$ 表示图像，这个图像的像素是符合正态分布的， $J_w$ 是生成器 $g$ 对 $w$ 的一阶矩阵，表示图像在 $w$ 上的变化， $a$ 是 $||J^T_w y||_2$ 动态的移动平均值，随着优化动态调整，自动找到一个全局最优值。

path length can be used to measure GAN’s performance. Another possible sign of trouble is the path distance varies a lot between different segments along the interpolation path. In short, we prefer the linear interpolated points to have similar image distances between consecutive points. In another word, the same displacement in the latent space should yield the same magnitude change in the image space, regardless of the value of the latent factor. Cost is added when the change in the image space is different from the ideal expected displacement. The change in the image space is computed from the gradient which is available for the backpropagation and the expected displacement is approximated by the running average so far.

更具体的实现如下，就直接放论文了，

No Progressive growth

StyleGAN使用的Progressive growth会有一些缺点，如下图，当人脸向左右偏转的时候，牙齿却没有偏转，即人脸的一些细节如牙齿、眼珠等位置比较固定，没有根据人脸偏转而变化，造成这种现象是因为采用了Progressive growth训练，Progressive growth是先训练低分辨率，等训练稳定后，再加入高一层的分辨率进行训练，训练稳定后再增加分辨率，即每一种分辨率都会去输出结果，这会导致输出频率较高的细节，如下图中的牙齿，而忽视了移动的变化。paper的解释如下：

We believe the problem is that in progressive growing each resolution serves momentarily as the output resolution, forcing it to generate maximal frequency details, which then leads to the trained network to have excessively high frequencies in the intermediate layers, compromising shift invariance.

使用Progressive growth的原因是高分辨率图像生成需要的网络比较大比较深，当网络过深的时候不容易训练，但是skip connection可以解决深度网络的训练，因此有了下图中的三种网络结构，都采用了skip connection，三种网络结构的效果也进行了实验评估，如下下图。

对上述三种网络结构的实验比较如下图：

How to project image to latent code

为什么要把图像生成隐藏编码呢？StyleGAN可以做很多有趣的事情，比如style mixing，但是如何混合指定图像的风格呢，而不是随机假图，这就需要得到指定图像的latent code，这个latent code输入到网络中去，能够复原指定图像。

那么如何生成指定图像的 latent code 呢？论文《Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?》讲的很详尽，有兴趣的可以详细阅读。有两种方法可以将图像映射成 latent code，（1）训练一个编码器，编码器输入是图像，输出是隐藏编码，这种方法的优点是速度快，缺点是不能处理训练数据集以外的图片；（2）选一个随机的latent code，输入到预训练好的网络（比如StyleGAN），根据生成图像与目标图像的损失，通过反向传播对latent code进行迭代，这种方法的泛化性很好，但是速度很慢，因为要迭代很多次。

In general, there are two existing approaches to embed instances from the image space to the latent space: i) learn an encoder that maps a given image to the latent space (e.g. the Variational Auto-Encoder); ii) select a random initial latent code and optimize it using gradient descent.