Deep interest network for click-through rate prediction

Feb 22 2019 4 minutes de lectura (Alrededor de 604 palabras)

在常规的基于深度学习的点击率预估模型中，用户的兴趣通常是用固定的向量来表示的，无法候选的商品是什么，用户的兴趣向量都是相同的。这并不是合理的，对于某一个商品来说，决定用户点还是不点，只与用户的历史行为中的一部分有关系。很自然地，我们想到使用Attention的方法来对不同的历史兴趣进行软选择。这就是阿里Deep Interest Network的做法。

论文的主要工作便是在用户兴趣的表征上使用了Attention机制（论文中称之为local activation unit），不过稍微有一点不一样。在标准的Attention中，权重是通过softmax进行了规一化的，在DIN中取消了归一化。不过没看到这个的效果和直接使用Attention的区别。

68747470733a2f2f63646e2e6e6c61726b2e636f6d2f6c61726b2f302f323031382f706e672f33363135342f313534313939343933383138362d36356637656562632d623234362d346661312d396433352d3563616439653232613038642e706e67

论文同时提出了针对模型训练的两点改进:

基于Mini-Batch的L2正则化。常规的正则化下，每次迭代都涉及到对所有参数的更新，对于亿级的稀疏特征来说，这个代价太大了。论文中将正则化涉及的参数限制在了仅在Mini-batch出现过的特征所影响的权重, 有效地缓解了过拟合的问题。
一种新的，针对数据分布自适应的激活函数，称为Dice。

68747470733a2f2f63646e2e6e6c61726b2e636f6d2f6c61726b2f302f323031382f706e672f363039382f313534323333393532353035302d38623033306134612d626439662d343839312d616363312d3661623032313932333635642e706e67

参考

深度兴趣网络(DIN) · alibaba/x-deeplearning Wiki · GitHub)
Deep Models — deepCTR 1.0.1 documentation
论文地址: [arXiv‘2017]Zhou, Guorui, et al.Deep interest network for click-through rate prediction, arXiv preprint arXiv:1706.06978 (2017).

AutoEncoders

Jan 29 2019 2 minutes de lectura (Alrededor de 276 palabras)

Autoencoders are kind of neural networks that try to reconstruct the inputs, i.e., the output is the same as the input. It can be devided into a encoder and a decoder, which is illustrated as follows:

Encoder-Decoder architecture

Autoencoders can be used in image or sound compressing and dimensionality reduction. In specific cases they can provide more interesing and efficient data projections than PCA or other dimentionality reduction techniques. Also, Denoising Autoencoders can be used to denoise a noisy image, and can further be used in representation learning.

The following gist shows how to construct a naive autoencoder that try to rebuild the handwritten digit in immortal Mnist dataset:

Running results:

Autoencoders

For denoising autoencoders, the input is random noisy images and the output is corresponding clean image. The following gist should clear any confusion:

Result:

Denoising Autoencoder

And using convolution networks for encoder and decoder should give better results. Gist is here:

Result:

CNN Denoising Autoencoder

Autoencoders can also be used in Representation learning, and can be used to encode inputs other than images, even categorical data. Practically the hidden feature size is larger than the input size to capture more info. A detailed explanation can be found at: http://dkopczyk.quantee.co.uk/dae-part3/

References:
[1] Denoising Autoencoder by Dawid Kopczyk

Astrous Convolution

Jan 21 2019 2 minutes de lectura (Alrededor de 292 palabras)

Astrous Convolution, or Dialated Convolution, is firstly introduced in paper Multi-Scale Context Aggregation by Dilated Convolutions. It is aimed to increase the size of reception field without increasing parameter sizes. It’s mainly use in the field of semantic segmentation. By stacking a series of dialated convolution with different rates(e.g., dialate [1, 3, 5] with 3x3 kernel), it can fully cover the original input features.

Standard Conv:

Dialated Conv:

Reception field size of n-dialated conv:

Standard discrete conv is just 1-dialated conv。

In tensorflow, we can either use tf.nn.conv2d or tf.nn.astrous_conv2d to perform a dialated conv operation. The following is a demonstration code:

import tensorflow as tf
import numpy as np

matrix = np.array([
    [1, 0, 1, 0, 1, 0, 1],
    [1, 2, 3, 4, 5, 6, 7],
    [1, 1, 1, 1, 1, 1, 1],
    [2, 3, 2, 3, 2, 3, 2],
    [1, 3, 1, 3, 1, 3, 1],
    [1, 2, 3, 1, 2, 3, 1],
    [1, 2, 3, 4, 1, 2, 3]
])
kernel = np.array([
    [1, 1, 1],
    [1, 2, 1],
    [1, 1, 1]
])

matrix = np.expand_dims(np.expand_dims(matrix, axis = 3), axis = 0)
kernel = np.expand_dims(np.expand_dims(kernel, axis = 3), axis = 4)
tmatrix = tf.constant(matrix, dtype = tf.float32)
tkernel = tf.constant(kernel, dtype = tf.float32)
with tf.Session() as sess:
    # The following 2 lines of code should have the same results.
    _ret1 = sess.run(tf.nn.conv2d(tmatrix, tkernel, strides=[1, 1, 1, 1], padding = "VALID", dilations=[1, 2, 2, 1]))
    _ret2 = sess.run(tf.nn.atrous_conv2d(tmatrix, tkernel, padding = "VALID", rate = 2))

References

[1] https://www.zhihu.com/question/54149221
[2] Rethinking Atrous Convolution for Semantic Image Segmentation
[3] Understanding 2D Dilated Convolution Operation with Examples in Numpy and Tensorflow with Interactive Code

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Jan 20 2019 7 minutes de lectura (Alrededor de 1021 palabras)

Arxiv: 1704.04861

MobileNet是由Google团队提出的应用于移动及嵌入式设备的轻量级神经网络。在这些场景中，由于对时延的实时要求，模型需要运行在端侧。因此，对于模型的预测速度、大小都有比较高的要求，同时不能牺牲过多的精度。在此之前，一般的做法是对神经网络进行压缩，或是直接训练较小的神经网络。而MobileNet另辟蹊径，通过深度可分离卷积(Depth-wise Separable Convolution)大大减少了参数数量和计算量(Mutli-Adds)。

在论文MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications 中，作者详细阐述了深度可分离卷积的原理，并介绍了两个用于调节模型大小的参数: Width Multiplier 和 Resolution Multiplier。

Depthwise Separable Convolution

深度可分离卷积最早由L.Sifre在Rigid-motion scattering for image classification一文中提出，后被Google应用于Inception和Xeption网络中。我们都知道，标准卷积的卷积核是作用于所有的通道的，可看作是所有通道的二维卷积的加权和。若卷积核的大小为df, 输入的通道数为C, 则单次卷积的计算量为dfdfC。而深度可分离卷积则将这个过程分成了一个Depwise Conv和一个1x1 Conv(又称Pointwise Conv)。所谓Depthwise Conv，是指每个通道使用独立的卷积。这个过程可以用下图表示

另有一个更加形象一点图:

与标准卷积的比较:

可以看到，深度可分离卷积把标准卷积的卷积 * 通道数的操作变成了一个加法操作。其相对于标准卷积的计算量:

一般来说，卷积的大小都很小（通常都不大于3），因此，深度可分离卷积的计算量相当于标准卷积的1/DK^2.

MobileNet的完整网络结构如下:

它的参数量绝大部分都集中于1x1卷积和全连接层上:

Width Multiplier and Resolution Multiplier

这两个参数是用来控制模型的大小的。其中， width Multiplier 用于控制通道的数量。即通道数 = 正常通道数 * width multiplier。而resolution multiplier则直接通过修改输入图片的大小来反应。其对参数量的影响如下:

附录

基于Tensorflow的实现(摘自Github: https://github.com/Zehaos/MobileNet):

def mobilenet(inputs,
          is_training=True,
          width_multiplier=1,
          scope='MobileNet'):
  def _depthwise_separable_conv(inputs,
                                num_pwc_filters,
                                width_multiplier,
                                sc,
                                downsample=False):
    """ Helper function to build the depth-wise separable convolution layer.
    """
    num_pwc_filters = round(num_pwc_filters * width_multiplier)
    _stride = 2 if downsample else 1

    # skip pointwise by setting num_outputs=None
    depthwise_conv = slim.separable_convolution2d(inputs,
                                                  num_outputs=None,
                                                  stride=_stride,
                                                  depth_multiplier=1,
                                                  kernel_size=[3, 3],
                                                  scope=sc+'/depthwise_conv')

    bn = slim.batch_norm(depthwise_conv, scope=sc+'/dw_batch_norm')
    pointwise_conv = slim.convolution2d(bn,
                                        num_pwc_filters,
                                        kernel_size=[1, 1],
                                        scope=sc+'/pointwise_conv')
    bn = slim.batch_norm(pointwise_conv, scope=sc+'/pw_batch_norm')
    return bn

  with tf.variable_scope(scope) as sc:
    end_points_collection = sc.name + '_end_points'
    with slim.arg_scope([slim.convolution2d, slim.separable_convolution2d],
                        activation_fn=None,
                        outputs_collections=[end_points_collection]):
      with slim.arg_scope([slim.batch_norm],
                          is_training=is_training,
                          activation_fn=tf.nn.relu):
        net = slim.convolution2d(inputs, round(32 * width_multiplier), [3, 3], stride=2, padding='SAME', scope='conv_1')
        net = slim.batch_norm(net, scope='conv_1/batch_norm')
        net = _depthwise_separable_conv(net, 64, width_multiplier, sc='conv_ds_2')
        net = _depthwise_separable_conv(net, 128, width_multiplier, downsample=True, sc='conv_ds_3')
        net = _depthwise_separable_conv(net, 128, width_multiplier, sc='conv_ds_4')
        net = _depthwise_separable_conv(net, 256, width_multiplier, downsample=True, sc='conv_ds_5')
        net = _depthwise_separable_conv(net, 256, width_multiplier, sc='conv_ds_6')
        net = _depthwise_separable_conv(net, 512, width_multiplier, downsample=True, sc='conv_ds_7')

        net = _depthwise_separable_conv(net, 512, width_multiplier, sc='conv_ds_8')
        net = _depthwise_separable_conv(net, 512, width_multiplier, sc='conv_ds_9')
        net = _depthwise_separable_conv(net, 512, width_multiplier, sc='conv_ds_10')
        net = _depthwise_separable_conv(net, 512, width_multiplier, sc='conv_ds_11')
        net = _depthwise_separable_conv(net, 512, width_multiplier, sc='conv_ds_12')

        net = _depthwise_separable_conv(net, 1024, width_multiplier, downsample=True, sc='conv_ds_13')
        net = _depthwise_separable_conv(net, 1024, width_multiplier, sc='conv_ds_14')

    end_points = slim.utils.convert_collection_to_dict(end_points_collection)

  return end_points

NIMA: Neural Image Assessment

Jan 10 2019 8 minutes de lectura (Alrededor de 1191 palabras)

图片质量评估在现实中有着非常广泛的应用。例如，对于用户上传的图片，选择比较美观并且清晰的图片作为相册封面或者缩略图，或是进入推荐系统推荐给其他用户。图片质量一般可分为像素级的技术质量(technical quality)和美学质量(aesthetic quality)。前者跟照片的模糊程度、噪点及压缩块效应等各种因素有关（例如，在摄影中，这通常取决于器材和拍照的参数设置），后者则与人类主观上的情绪和美学感受有关，取决于人们自身的审美角度、经验和审美能力。从方法上，图片质量评估又可分为无参考的方法和有参考的方法，这里的参考指的是是否存在一张标准的参考图片。在现实场景中，通常并没有图片可以作为参考，因此无参考的方法使用更加广泛。

在深度学习流行起来之后，学术界快速将深度卷积网络使用到了图片质量评估上。通常的做法是，使用在大规模分类数据集（一般都是ImageNet）上的预训练模型(如AlexNet, VGG等), 再做fine tuning来预测图片的平均质量得分。

考虑到质量得分的主观性，相同的图片不同的人打出的质量分可能差别很大(对应的是方差很大)，例如，毕加索的画便有许多的人无法欣赏，这个现象是无法反应在平均值上的。因此，在论文《NIMA: Neural Image Assessment》中， Google Research的研究人员将预测的目标修改为了图片的质量得分的分布。具体来说，将得分从低到高划分为N(N=10)个桶，归一化后得到用户得分落在每个桶的概率。每个桶都可以看成是一个类别，因此，图片的质量预测成了一个多分类问题。网络结构如下:

由于这些类别之间存在序关系，对于分类器而言，将5错分为6和将5错分为10，这两种情况下的错误程度差别是很大的。而通常的交叉熵损失函数无法反应这种情况。在文中，作者使用的是 Earth Mover’s Distance, 定义如下:

其中, CDF为累积概率分布，定义为:

$$
CDF_p{k} = \sum_{i = 1}^{k}{p_{s_i}}
$$

在 Keras中, 其实现如下:

from keras import backend as K
def earth_movers_distance(y_true, y_pred):
    cdf_true = K.cumsum(y_true, axis=-1)
    cdf_pred = K.cumsum(y_pred, axis=-1)
    emd = K.sqrt(K.mean(K.square(cdf_true - cdf_pred), axis=-1))
    return K.mean(emd)

在文中，作者使用的是L2距离，因为方便求导。

作者在3个图片质量相关的数据集, AVA, TID2013和LIVE数据集上分别进行了训练和测试，同时对比了VGG, InceptionV2和MobileNet这3个流行的网络结构。在AVA数据集上的结果数据如图:

其中， LCC代表 linear correlation coefficient, SRCC代表 Spearman’s rank correlation coefficient, EMD代表Earth Mover Distance。

可以看出，Inception-v2 网络的准确度已经与当前的state of the art持平，并且NIMA由于只需要进行一次前身计算，性能远远胜出。

TID2013上的结果如下:

与最高水平还有一点点差距，不过贵在简单高效。

作者还给出了基于AVA数据集的模型在风景这个类别下的预测结果(平均分和方差):

可以看到，整体上比较符合人类的感官认知。

在文中，作者也提到，通过对色调、对比度的调整可以提高图片的美学质量分。由此可以指导自动化的图片编辑，比如自动增强图片，即寻找图片的最优对比度、色调等参数，使得其美学质量分最大化。如果这个过程本身是可微分的（例如采用的是神经网络), 那么可以通过类似Actor-Critic的模型端到端地解决。

实现:

idealo.de 公司的实现: https://github.com/idealo/image-quality-assessment
博客文章: Using Deep Learning to automatically rank millions of hotel images by idealo

Illustration2Vec: A Semantic Vector Representation of Illustrations

Jan 10 2019 8 minutes de lectura (Alrededor de 1246 palabras)

如今，动漫早已成为年轻人文化生活的一部分。对于做内容的互联网服务而言，二次元也是抓住年轻人兴趣的一个很好的切入点。然而，在信息如此泛滥的年代，在海量的二次元资源中，如何快速找到自己心宜的二次元漫画是一个很大的难题。来自日本东京大学的Masaki Saito作为一个动漫爱好者，从学术的角度，利用当前流行的深度学习方法提出了一些解决方案。论文《llustration2Vec: A Semantic Vector Representation of Illustrations》揭示了与此相关的诸多细节。

在这篇文章中， Saito 探讨了如何对动漫图片进行语义embedding的问题，从而使得基于海明距离的最近邻查找成为可能。基于此，作者还推出了一个称之为语义变形(Semantic Mophing)的工具: 给定两张动漫图片作为Query, 系统返回从这两张图片内容和风格之间渐进变化的动漫图片。如下图所示:

谈起图片向量化，一种很自然的想法是直接拿一个在ImageNet中训练好的模型，取倒数第二层或者第三层的节点输出即可。这在Keras中非常容易实现，只需要几行代码便可完成。Keras文档中给了一个例子:

base_model = VGG19(weights='imagenet')
model = Model(inputs=base_model.input, outputs=base_model.get_layer('block4_pool').output)

img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

block4_pool_features = model.predict(x)

当然还有更好的做法: 基于当前的数据集训练一个分类网络（在大部分情况，基于ImageNet预训练模型做fine-tuning即可），再取末层的全连接层或者feature map做为图片的embedding。

在这篇文章中，作者基于一些动漫网站服务收集了1287596张图片。这些图片都有着丰富的标签信息，具体来说包括四个维度：

代表一般内容属性的通用标签，如weapon, smile。
版权标签, 如vocoloid。
人物标签，如hatsune miku。
X分级标签，如safe, questionable, explicit。

作者从前三个类别中各挑选了最热门的512个标签，加上X分级的3个，一共得到1539个标签，并由此训练一个多标签分类模型。模型的特征提取部分复用了VGG16(论文发表于2015年，正是VGG风靡的时代)，考虑到标签预测需要更多地关注图片的局部细节，作者抛弃了VGG中的全连接层，以NIN网络替换之。在最后，拼接一个sigmoid以对每一个标签做概率预测。损失函数采用交叉熵。网络结构及于VGG16的对比如下图示:

在评测环节中，作者对比了该网络结构及预训练网络及VGG网络在各个分类下的MAP值，如下图示。

其中，预训练网络使用VGG的最后一个FC层来做特征提取，然后使用逻辑回归来对每个标签做二分类。可以看到， VGG + NIN的方法在各个类别下都有非常明显的提升。

作者也给出了部分具体图片的标签预测效果:

为了得到图片的二进制向量表示，作者在以上多标签分类网络的基础上，在最后一层输出之前，再插入一个sigmoid层。这可以看成是对特征图的输出值做了0到1范围内的压缩。这样一来，要得到01向量，只需要将sigmoid层的值做阈值为0.5的二值化即可。

基于海明距离的最近邻检索的效果如下:

还剩最后一个疑问，图片的语义变形是怎么实现的呢？很容易想到，在向量空间中，在两张query图片坐标之间的所有图片，都可以看作是过渡图片。为了加快计算，可以预先构造一个相似图：每张图片都可以看成图中的一个节点，对于每个节点，通过最近邻算法，找出其最相似的k张图片，从而建立k条边。这样一来，对于任意的两张图片，其中间的过渡图片即为其对应节点间最短路径上的节点。

项目地址: https://github.com/rezoo/illustration2vec

强化学习中的策略梯度算法

Dec 14 2018 2 minutes de lectura (Alrededor de 322 palabras)

在强化学习中，求解策略有两类方法：基于值函数的方法与基于策略的方法。基于值函数的方法包括SARSA, Q-Learning和Deep Q-Learning, 在动作数据有限时，这种方法可以通过迭代得到更好的确定性策略。但是在现实中，存在许多状态非离散的场景，而且，有时随机性的策略更加重要。基于策略的方法此时是更好的选择。

在最近几年，伴随着强化学习不断在各领域攻城略地，基于策略的方法也发展迅猛，从最原始的REINFORCE, 到AC, A2C, A3C, Trust Region以及DDPG和PPO, 策略算法训练不稳定、难以收敛、容易困在局部最优解这些问题一步一步有了更好的解法。

REINFORCE

REINFORCE是由Williams等人于1992年提出的基于Monte Carlo采样的回合制算法：根据当前的策略函数采样出一条轨迹，然后计算各个step的回报，再根据梯度上升法更新策略参数。

REINFORCE with Baseline

带Baseline的REINFORCE算法

A2C

A3C

DDPG

PPO

Learning Groupwise Scoring Functions Using Deep Neural Networks

Dec 14 2018 a minute de lectura (Alrededor de 189 palabras)

Arxiv: https://arxiv.org/abs/1811.04415v1

最近，Google开源了基于Tensorflow的Learning To Rank框架(Github)。在论文Learning Groupwise Scoring Functions Using Deep Neural Networks中，作者对TFRanking中的核心算法groupwise scoring functions (GSFs)做了详细说明。

在传统的LTR算法中，无论是Pointwise, Pairwise还是Listwise, 对单篇文章的评分并不依赖于相同列表中的其他文档，只是在损失函数层面去学习文档间的相对顺序。而实际上，用户通常都是在比较中选择的，其对一篇文档的点击意愿通常也取决于其上下的文档。 GSF正是基于这一思想实现的。

如图所示，

Numpy笔记

Dec 4 2018 2 minutes de lectura (Alrededor de 243 palabras)

np.meshgrid

在介绍momentum算法的一篇文章中看到了这个函数，刚开始看的时候一头雾水，后来发现这篇文章讲得比较明白。

meshgrid的作用是生成一个网格。对于一个网格来说，从下往上看，每一行的横坐标都是[x0, x1, x2, …, x_a], 一共有b行。每一行的纵坐标都是一样的，这b行的纵坐标对应了第二个参数的维度。因此，meshgrid实际上是返回了这个网格的所有点的横纵坐标。其中，横坐标为第一个矩阵，纵坐标为第二个矩阵。

meshgrid通常应用于等高线及超平面的绘制。如下图示:

Deep & Cross Network for Ad Click Predictions

Dec 1 2018 2 minutes de lectura (Alrededor de 311 palabras)

Arxiv: 1708.05123

这篇文章由Google与Standford合作发表于2017年，主要贡献在于提出了一个新的CTR预估网络架构，可以通过网络高效地学习特征的多层交叉表示。

网络结构如下图示：

网络可以看成是分别通过Cross Network和Deep Network来抽取特征，最后做一个线性融合。两个网络共享相同的输入，即将离散特征做Embedding之后，与归一化后的连续特征做Concat, 再分别输入到Cross Network和Deep Network中。Deep Network是常规的前馈网络。Cross Network则包含一层或多层Cross Layer, Cross Layer可以看成是以下函数:

作者证明， l层的Cross Network, 可以拟合特征的任意l+1阶关系。（证明过程符号太多，看不下去了。

此外， DCN可以看成是对FM的扩展， FM出于性能限制，只能建模两阶的特征交叉关系。

在实现上， Cross Layer的计算效率非常高。从公式中可以看出，计算复杂度为O(d), 其中d为特征的维度。

参考