🚮 🏆 🙇🏻 Mejora del marcado de datos multimodales: menos evaluadores, más capas 👃🏼 🛁 🧕🏻

¡Hola! Nosotros, los científicos del laboratorio de aprendizaje automático de ITMO y el equipo de Core ML en VKontakte, estamos realizando una investigación conjunta. Una de las tareas importantes de VK es clasificar automáticamente las publicaciones: es necesario no solo generar feeds temáticos, sino también identificar contenido no deseado. Los evaluadores están involucrados para dicho procesamiento de registros. Al mismo tiempo, el costo de su trabajo se puede reducir significativamente utilizando un paradigma de aprendizaje automático como el aprendizaje activo.

Se trata de su aplicación para la clasificación de datos multimodales que se discutirán en este artículo. Te contaremos sobre los principios y métodos generales del aprendizaje activo, las peculiaridades de su aplicación a la tarea, así como los insights obtenidos durante la investigación.

imagen

Introducción

— machine learning, . , , , .

, (, Amazon Mechanical Turk, .) . — reCAPTCHA, , , , — Google Street View. — .

. , Voyage — , . , , . , .

Amazon DALC (Deep Active Learning from targeted Crowds). , . Monte Carlo Dropout ( ). — noisy annotation. , « , », .

Amazon . : / . , , . , : , . .

— ! , . pool-based sampling.

Figura: 1. Esquema general de un escenario de aprendizaje activo basado en grupos

. 1. pool-based

. , , ( ). : , .

, — . (. — query). , . ( , ) .

, , .

, — . ( ). ≈250 . . () 50 — — :

, (. embedding), ;
.

, (. . 2).

. 2 —

. 2 —

ML — . , .

. , . , , , . , , early stopping. , .

. residual , highway , (. encoder). , (. fusion): , .

— , . -.

, — , . , .

. , (. 3):

. 3.

. 3.

. , . , , . , ( + ) — .

, . 3, :

. 4.

. 4.

, , . , ó , , .

, : ? :

;
;
.

. : maximum likelihood , - . :

L = \frac{1}{σ_{1}^{2}} L_{1} + \frac{1}{σ_{2}^{2}} L_{2} + \frac{1}{σ_{3}^{2}} L_{3} + \log σ_{1} + \log σ_{2} + \log σ_{3}

$L_{1}, L_{2}, L_{3}$ — ( -), $σ_{1}, σ_{2}, σ_{3}$ — , .

Pool-based sampling

— , . pool-based sampling :

- .
.
, , .
.
( ).
3–5 (, ).

, 3–6 — .

, , :

, . , : . , , , . . , 2 000.
. , . ( ). , , . , . 20 .

. , . — , . 100 200.

, , , .

№1: batch size

baseline , ( ) (. 5).

. 5. baseline- .

random state. .

. «» , , .

, (. batch size). 512 — - (50). , batch size . . :

upsample, ;
, .

batch size: (1).

c u r r e n t_b a t c h_s i z e = b + ⌊ \frac{n \mod b}{⌊ \frac{n}{b} ⌋} ⌋ [1]

$b$ — batch size, $n$ — .

“” (. 6).

. 6. batch size (passive ) (passive + flexible )

: c . , , batch size . .

Uncertainty

— uncertainty sampling. , , .

1. (. Least confident sampling)

, :

x_{L C}^{*} = \underset{x}{\arg max} 1 - P_{θ} (\hat{y} | x) [2]

$\hat{y} = \underset{y}{\arg max} P_{θ} (y | x)$ — , $y$ — , $x$ — , $x_{L C}^{*}$ — , .

. , $1 - \hat{y}$ . , . .

. , : {0,5; 0,49; 0,01}, — {0,49; 0,255; 0,255}. , (0,49) , (0,5). , ó : . , .

2. (. Margin sampling)

, , , :

x_{M}^{*} = \underset{x}{\arg min} P_{θ} ({\hat{y}}_{1} | x) - P_{θ} ({\hat{y}}_{2} | x) [3]

${\hat{y}}_{1}$ — $x$ , ${\hat{y}}_{2}$ — .

, . , . , , MNIST ( ) — , . .

3. (. Entropy sampling)

x_{H}^{*} = \underset{x}{\arg max} - \sum P_{θ} (y_{i} | x) \log P_{θ} (y_{i} | x) [4]

$y_{i}$ — $i$ - $x$ .

, , . :

, , ;
, .

, , . , entropy sampling .

(. 7).

. 7. uncertainty sampling ( — , — , — )

, least confident entropy sampling , . margin sampling .

, , : MNIST. , , entropy sampling , . , .

. $O (p \log q)$ , $p$ — , $q$ — . , .

BALD

, , — BALD sampling (Bayesian Active Learning by Disagreement). .

, query-by-committee (QBC). — . uncertainty sampling. , . QBC Monte Carlo Dropout, .

, , — . dropout . dropout , ( ). , dropout- (. 8). Monte Carlo Dropout (MC Dropout) . , . ( dropout) Mutual Information (MI). MI , , — , . .

. 8. MC Dropout BALD

, QBC MC Dropout uncertainty sampling. , (. 9).

. 9. uncertainty sampling QBC ( - , - , - )

. 9. uncertainty sampling ( QBC ) ( — , — , — )

BALD. , Mutual Information :

a_{B A L D} = H (y_{1}, . . ., y_{n}) - E [H (y_{1}, . . ., y_{n} | ω)] [5]

E [H (y_{1}, . . ., y_{n} | w)] = \frac{1}{k} \sum_{i = 1}^{n} \sum_{j = 1}^{k} H (y_{i} | w_{j}) [6]

$n$ — , $k$ — .

(5) , — . , , . BALD . 10.

. 10. BALD

, , .

query-by-committee BALD , . , uncertainty sampling. , — $O (k p \log (q))$ , $p$ — , $q$ — , $k$ — , .

BALD tf.keras, . PyTorch, dropout , batch normalization , .

№2: batch normalization

batch normalization. batch normalization — , . , , , , batch normalization. , . , . BALD. (. 11).

. 11. batch normalization BALD

, , .

batch normalization, . , .

Learning loss

. , . , .

, . — . , . learning loss, . , (. 12).

. 12. Learning loss

learning loss . .

. , . «» learning loss: , , . ideal learning loss (. 13).

. 13. ideal learning loss

, learning loss.

, . , , - , . :

(2000 ), ;
10000 ( );
;
;
100 ;
, , 1;
.

, , . , ( margin sampling).

1.

		p-value
loss	-0,2518	0,0115
margin	0,2461	0,0136

, margin sampling — , , , . c .

: ?

, , (. 14).

. 14. ideal learning loss ideal learning loss

, MNIST :

2. MNIST

		p-value
loss	0,2140	0,0326
	0,2040	0,0418

ideal learning loss , (. 15).

Figura: 15. Entrenamiento activo del clasificador de caracteres del conjunto de datos MNIST con la estrategia ideal de pérdida de aprendizaje. Gráfico azul: pérdida de aprendizaje ideal, naranja: aprendizaje pasivo

. 15. MNIST ideal learning loss. — ideal learning loss, —

, , , , . .

learning loss , uncertainty sampling: $O (p \log q)$ , $p$ — , $q$ — . , , . , .

, . . , margin sampling — . 16.

Figura: 16. Comparación de la capacitación sobre datos seleccionados al azar (capacitación pasiva) y sobre datos seleccionados mediante la estrategia de muestreo de márgenes

. 16. ( ) , margin sampling

: ( — margin sampling), — , , . ≈25 . . 25% — .

, . , , .

, , . , :

batch size;
, , — , batch normalization.

Mejora del marcado de datos multimodales: menos evaluadores, más capas