Continual Learning | FlowerMouse Blog

type

status

date

slug

summary

概述

Continual Learning，又称 Life-long Learning，Incremental Learning.

‣

catastrophic forgetting

意义何在？

multi-task training

computation，storage

multi-task training can be considered as the upper bound of LLL

问题设定

In a general sense, continual learning is explicitly limited by catastrophic forgetting, where learning a new task usually results in a dramatic performance degradation of the old tasks.

持续学习主要解决在学习新知识的过程中对于旧知识的灾难性遗忘的问题。

对于此，我想到了关于 continual learning 与 transfer learning 以及 online learning 的区别。在线学习和持续学习都需要对新知识进行训练，但是区别在于在线学习允许遗忘，以尽可能快地适应新的分布。而持续学习则要避免对旧知识的遗忘。同样的，transfer learning 只在乎在新的任务上表现得好，而不在乎在旧任务上也表现得好。

‣

李宏毅老师课程

Domain Adaptation

‣

Domain shift

输入分布不同

source domain

target domain

有标签

fine-tune

feature-extractor

Domain Adversarial Training

类似 GAN

feature extractor 既要能骗过 domain classifier，又要能有好的分类 loss

Universal domain adaptation

source domain 和 target domain 的 label 不同

Domain generalization

Metrics

以下度量中的指的都是在学完第个 task 之后的测试，即第 k 行

Overall Performance

average accuracy (AA)

average incremental accuracy (AIA)

Memory Stability

backward transfer (BWT)

“之后学习其他任务对本任务的影响”

几乎总是负数，衡量刚学习完和学习其他任务之后的差异

越高越好（正数=促进，负数=遗忘）

forgetting measure (FM)

“知识的遗忘量“，越小越好

不是与刚学习完比，而是与至目前为止最好的比

Learning Plasticity

forward transfer (FWT)

学习旧任务，对学习新任务有多大的“帮助”作用，越大越好（知识的利用能力）

intransience measure (IM)

是在 m 个任务的联合分布（联合数据集）上训练得到的测试 k 任务的准确率

模型在学习新知识时有多‘顽固’，越低越好（学习新知识的可塑性）

IM 总是大于 0 的。

Network Compression

Network Pruning

Weight Pruning

以参数为单位时，难以进行实践，因为这样会造成pruning后的网络会不规则，难以实现和加速

若直接将无用的参数设置为0，则实际上网络的大小并没有减小，而且并没有达到加速。

Neuron Pruning

以神经元为单位做 Pruning，就易于实现和加速

为什么不直接训练一个参数量小的神经网络？

原因：大的 Network 往往训练起来相对容易，也就是说通过大 Network Prune 之后的网络和直接训练相同参数量的小的 Network 比较，其性能会更好。

Lottery Ticket Hypothesis

一个 Large network 可以视为包含多个 sub-network，而只要有一个 sub-network 成功，large network 也会成功。相当与对于买多张彩票，相比于买一张彩票，中奖的概率更大。

并且参数的随机初始化也很重要，一个好的随机初始化会造成好的效果，即使用与 prune 后的模型的相同结构的 small network，其也可能会因为参数初始化的问题而训练不起来。而 prune 之后的网络会继承 large-network 的相应参数，不存在这种问题。

保持参数的符号很关键

直接 remove large network 的 neurons 得到的 sub-network，不经过 fine-tune，也可以得到不错的效果。

Methods

Regularization-Based Approach

Selective Synaptic Plascity / Weight Regularization

regularization based approach

越大，则说明参数越重要，尽量改变的幅度要小。

如何计算？

EWC（Elastic Weight Consolidation）

EWC 是通过 Fisher information matrix (FIM) 来估算这个重要性的。

每次模型在学习完旧任务 A，准备学习新任务 B 前，会通过模型的当前参数和 A 的训练集计算出 FIM 以衡量参数的重要性，然后在学习新任务 B 时，在 loss 函数上加上关于 FIM 的正则项。

Synaptic Intelligence (SI) • https://arxiv.org/abs/1703.04200 Memory Aware Synapses (MAS) • https://arxiv.org/abs/1711.09601 RWalk • https://arxiv.org/abs/1801.10112 Sliced Cramer Preservation (SCP) • https://openreview.net/forum?id=BJge3TNKwH

Optimization-Based Approach

Gradient Projection

GEM （Gradient Episodic Memory）

引用自 Hung-Yi Lee ML 2021 Spring 课件

need the data from the previous tasks

Meta-Learning

Architecture-Based Approach

Parameter Allocation

Parameter allocation features an isolated parameter subspace dedicated to each task throughout the network, where the architecture can be fixed or dynamic in size.

LoRA （Low-Rank Adaptation)

为每个新任务训练一个独立的、小型的 LoRA 适配器，而保持大型预训练模型（Backbone）的权重完全不变。

它通过在预训练模型的每一层注入两个小型的、可训练的低秩矩阵（A和B）来适应新任务，即 W' = W₀ + BA。在微调时，只有 A 和 B 的参数会被更新，而原始的、巨大的权重矩阵 W₀ 始终保持冻结。

参数量从减小到了

推理 inference

当需要执行某个特定任务时，比如 Task 2，模型会：

加载共享的、冻结的预训练模型 W₀。

加载并激活为 Task 2 专门训练的 LoRA_2 适配器。

模型的有效权重变为 W₀ + (B₂A₂)，然后进行预测。

Additional Neural Resource Allocation

Progressive Neural Networks

PackNet

CPG (Compacting, Picking, and Growing)

Replay-Based Approach

Generative Replay

Generating Data

generative model

Representation-Based Approach

Curriculum Learning

task 的学习顺序会对 forgetting 的影响也会不同

研究 task 的学习顺序的相关方向就叫 curriculum leaning

‣

概述