👃🏻 〰️ 🧑🏼‍🤝‍🧑🏼 Capacitación en datos tabulares. TABNet. Parte 1 ♣️ ⁉️ 👨‍🎨

Queríamos presentar la traducción de un artículo interesante sobre el aprendizaje mediante redes neuronales en datos tabulares. La segunda parte está aquí.

Brevemente

Presenta TabNet, una nueva arquitectura canónica de aprendizaje profundo de alto rendimiento basada en datos tabulares. TabNet utiliza evaluaciones secuenciales de la elección de características que se utilizarán en cada punto de decisión. Esto asegura la interpretabilidad y eficiencia del proceso de aprendizaje, ya que la capacidad de aprender está determinada por las funciones más relevantes (las más adecuadas, según las estimaciones consideradas de la elección de la solución). Se ha demostrado que TabNet supera a otras arquitecturas de árbol de decisión y redes neuronales en una amplia gama de conjuntos de datos escalares tabulares en la interpretación de sus atributos de rendimiento, lo que lleva a una comprensión del comportamiento general del modelo. Finalmente, por primera vez, hasta donde sabemos,demostramos el aprendizaje auto-supervisado para datos tabulares con un aumento significativo en la tasa de aprendizaje y una muestra de datos inicial suficientemente grande.

1. Introducción

Las redes neuronales profundas (GNN) han demostrado su éxito al trabajar con imágenes [21, 50], texto [9, 34] y sonido [1, 56]. Para este tipo de datos, el principal factor de desarrollo es la disponibilidad de arquitecturas canónicas que permiten codificar eficientemente las secuencias iniciales en secuencias de entrenamiento, para asegurar un alto rendimiento en nuevos conjuntos de datos y tareas resueltas con su ayuda con mínimos recursos. Por ejemplo, en la interpretación de imágenes, las variantes de redes convolucionales residuales (en particular, ResNet [21]) deberían proporcionar un rendimiento razonablemente bueno cuando se trabaja con nuevos conjuntos de datos para imágenes o problemas de reconocimiento visual relacionados (por ejemplo, clasificación, taxonomía). El único tipo de datos sobre los que aún no se ha logrado el éxito de la arquitectura canónica del GNS son los datos tabulares. A pesar de,que es el tipo de datos más común en las implementaciones de IA [8], el aprendizaje profundo para datos tabulares sigue siendo poco conocido y las variantes de árboles de decisión de conjuntos aún dominan la mayoría de las aplicaciones [28]. ¿Por qué esto es tan? Primero, porque los enfoques basados en árboles tienen ciertas ventajas que los hacen populares: (i) son suficientemente representativos (y por lo tanto, a menudo altamente eficientes) para múltiples soluciones con límites de distribución de hiperplano difusos para datos tabulares; (ii) están bien interpretados (por ejemplo, mediante el seguimiento de decisiones nodales) y existen métodos efectivos para una explicación a posteriori de la forma de su conjunto, que es [36] una tarea importante en muchas aplicaciones del mundo real (por ejemplo, en servicios financieros, donde la confianza en acciones con alto riesgo es critico);(iii) aprenden rápidamente. En segundo lugar, las arquitecturas GNS propuestas anteriormente no se adaptan a los datos tabulares: los GNS convencionales en capas convolucionales o perceptrones multicapa (MLP) suelen estar altamente parametrizados (por el número de parámetros y por la complejidad de su identificación); la ausencia de un sesgo inductivo correspondiente conduce al hecho de que no lo son. puede encontrar la solución óptima para la variedad de soluciones tabulares [17]. ¿Por qué estudiar el aprendizaje profundo para datos tabulares? Una razón obvia es que, como en otras áreas, se pueden esperar mejoras en el rendimiento de las arquitecturas basadas en GNS, especialmente para grandes conjuntos de datos [22] Además, a diferencia del aprendizaje de árbol (jerárquico), que no utiliza la propagación hacia atrás de los errores de datos para impulsar el aprendizaje efectivo a partir de señales erróneas,Los GNN brindan estrategias de aprendizaje de descenso de gradiente de extremo a extremo para datos tabulares, con muchas ventajas demostradas en muchas áreas diferentes, lo que permite: (i) codificar de manera eficiente muchos tipos de datos, como imágenes en forma de datos tabulares; (ii) facilitar o eliminar la necesidad de desarrollar características, que actualmente es un aspecto clave de los métodos de aprendizaje basados en árboles que utilizan datos tabulares; (iii) entrenar en la transmisión de datos: el entrenamiento en una estructura de árbol requiere estadísticas globales para seleccionar puntos nodales, y modificaciones simples, como en [4], generalmente dan menor precisión en comparación con el entrenamiento para toda la muestra de datos; Por el contrario, los STS demuestran un mayor potencial para el aprendizaje permanente [44]; (iv) explorar en modelos de presentación de extremo a extremo,permitiendo nuevos escenarios valiosos para nuevas aplicaciones, incluida la adaptación a las áreas de uso eficiente de datos [17], modelado generativo [46] y aprendizaje parcial del docente [11]

, , . , ? - TabNet, « » ( ) ( ). , TabNet : . , - , . , : (1) , TabNet ; (2) TabNet , , , , (. . 1); , , , , [6] [61], Tab-Net .

1. TabNet [14]. , . TabNet , . . , , , .

(3) , : (a) TabNet ; (b) TabNet : , , , .

(4) , , (. . 2).

2.

: , , () . , LASSO [20], , , . , [6] , [61] «-» . , TabNet , () , .

: . [18]. , (). – [23], . XGBoost [7] LightGBM [30] - , (Data Science). , , , .

DNN : , [26], . () [33, 58] . , . [60] , . [31] -, , , . [53] - « » (, ), . TabNet , .

: - , [3, 35] . , .

: , , [47]. [13] [55] - .

3. () (). . , ( , ) ReLU , . . C1 C2, - Softmax ( ).

3. TABNET

. (. . 3 ). . , () . TabNet - . , , , :

(i) , ; (ii) , , ; (iii) ; (iv) .

4. ) TabNet , , . , , . , . (b) TabNet, . (c) – 4- , 2 2 . (, Fully-Connected) (Batch Normalization) (Gted Linear Unit). (d) – , , . sparsemax [37] .

. 4 TabNet . . . , (). D-

$f \ en R ^ {(B × D)}$

, B- . TabNet N .

i- (i - 1)- , , . (, [25]) [40] .

, . ( ) , . .

$M [i] ∈ R ^ {(B × D)}$

. , , , . , M[i] · f. (. . 4) , , a[i − 1]:

$M [i] = sparsemax (P [i - 1] · h_i (a [i - 1])) \ (1)$

Sparsemax [37] , .

, 1

$\ sum_ {j = 1} ^ {D} M [i] _b, _j = 1$

h[i] - , . 4., FC, BN, P[i] - , , :

$P [i] = \ prod_ {j = 1} ^ {i = 1} (\ gamma - M [j]), \ (2)$

γ - : γ = 1, γ, . P[0] ,

$1 ^ {B × D}$

- . ( ), P[0] , . [19]:

$L_ {escaso} = \ sum_ {i = 1} ^ {N_ {pasos}} \ sum_ {b = 1} ^ {B} \ sum_ {j = 1} ^ {D} \ frac {-M_ {b, j } [i]} {N_ {pasos} * B} registro (M_ {b, j} {[i]} + \ epsilon)$

ϵ- . λ . , .

: (. . 4) ,

$[d [i], a [i]] = fi (M [i] · f), donde \ d [i] ∈ R ^ {B × N_d} \ y \ a [i] ∈ R ^ {B × N_a }.$

, ( ), , .

. 4 . FC BN (GLU) [12], . √0.5 , , [15]. . BN, , , BN [24] BV mB. , , BN. , , . 3,

$d_ {fuera} = \ sum_ {i = 1} ^ {N_ {pasos}} ReLU (d [i])$

$W_ {final} d_ {out}$

. softmax ( argmax ).

TABNet.

, , , .

[1] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, et al. 2015. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv:1512.02595 (2015).

[2] AutoML. 2019. AutoML Tables – Google Cloud. https://cloud.google.com/automl-tables/

[3] J. Bao, D. Tang, N. Duan, Z. Yan, M. Zhou, and T. Zhao. 2019. Text Generation From Tables. IEEE Trans Audio, Speech, and Language Processing 27, 2 (Feb 2019), 311–320.

[4] Yael Ben-Haim and Elad Tom-Tov. 2010. A Streaming Parallel Decision Tree Algorithm. JMLR 11 (March 2010), 849–872.

[5] Catboost. 2019. Benchmarks. https://github.com/catboost/benchmarks. Accessed: 2019-11-10.

[6] Jianbo Chen, Le Song, Martin J. Wainwright, and Michael I. Jordan. 2018. Learning to Explain: An Information-Theoretic Perspective on Model Interpretation. arXiv:1802.07814 (2018).

[7] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In KDD.

[8] Michael Chui, James Manyika, Mehdi Miremadi, Nicolaus Henke, Rita Chung, et al. 2018. Notes from the AI Frontier. McKinsey Global Institute (4 2018).

[9] Alexis Conneau, Holger Schwenk, Lo¨ıc Barrault, and Yann LeCun. 2016. Very Deep Convolutional Networks for Natural Language Processing. arXiv:1606.01781 (2016).

[10] Corinna Cortes, Xavi Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. 2016. AdaNet: Adaptive Structural Learning of Artificial Neural Networks. arXiv:1607.01097 (2016).

[11] Zihang Dai, Zhilin Yang, Fan Yang, William W. Cohen, and Ruslan Salakhutdinov. 2017. Good Semi-supervised Learning that Requires a Bad GAN. arxiv:1705.09783 (2017).

[12] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language Modeling with Gated Convolutional Networks. arXiv:1612.08083 (2016).

[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (2018).

[14] Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http: //archive.ics.uci.edu/ml

[15] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional Sequence to Sequence Learning. arXiv:1705.03122 (2017).

[16] Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. Extremely randomized trees. Machine Learning 63, 1 (01 Apr 2006), 3–42.

[17] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

[18] K. Grabczewski and N. Jankowski. 2005. Feature selection with decision tree criterion. In HIS.

[19] Yves Grandvalet and Yoshua Bengio. 2004. Semi-supervised Learning by Entropy Minimization. In NIPS.

[20] Isabelle Guyon and Andre Elisseeff. 2003. An Introduction to Variable and Feature ´ Selection. JMLR 3 (March 2003), 1157–1182.

[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 (2015).

[22] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory F. Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. 2017. Deep Learning Scaling is Predictable, Empirically. arXiv:1712.00409 (2017).

[23] Tin Kam Ho. 1998. The random subspace method for constructing decision forests. PAMI 20, 8 (Aug 1998), 832–844.

[24] Elad Hoffer, Itay Hubara, and Daniel Soudry. 2017. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv:1705.08741 (2017).

[25] Drew A. Hudson and Christopher D. Manning. 2018. Compositional Attention Networks for Machine Reasoning. arXiv:1803.03067 (2018).

[26] K. D. Humbird, J. L. Peterson, and R. G. McClarren. 2018. Deep Neural Network Initialization With Decision Trees. IEEE Trans Neural Networks and Learning Systems (2018).

[27] Mark Ibrahim, Melissa Louie, Ceena Modarres, and John W. Paisley. 2019. Global Explanations of Neural Networks: Mapping the Landscape of Predictions. arxiv:1902.02384 (2019).

[28] Kaggle. 2019. Historical Data Science Trends on Kaggle. https://www.kaggle. com/shivamb/data-science-trends-on-kaggle. Accessed: 2019-04-20.

[29] Kaggle. 2019. Rossmann Store Sales. https://www.kaggle.com/c/ rossmann-store-sales. Accessed: 2019-11-10.

[30] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, et al. 2017. LightGBM: A Highly Effcient Gradient Boosting Decision Tree. In NIPS.

[31] Guolin Ke, Jia Zhang, Zhenhui Xu, Jiang Bian, and Tie-Yan Liu. 2019. TabNN: A Universal Neural Network Solution for Tabular Data. https://openreview.net/forum?id=r1eJssCqY7

[32] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. In ICLR.

[33] P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bul. 2015. Deep Neural Decision Forests. In ICCV.

[34] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional Neural Networks for Text Classification. In AAAI.

[35] Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2017. Table-to-text Generation by Structure-aware Seq2seq Learning. arXiv:1711.09724 (2017).

[36] Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. 2018. Consistent Individualized Feature Attribution for Tree Ensembles. arXiv:1802.03888 (2018).

[37] Andre F. T. Martins and Ram ´ on Fern ´ andez Astudillo. 2016. From Softmax ´ to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. arXiv:1602.02068 (2016).

[38] Rory Mitchell, Andrey Adinets, Thejaswi Rao, and Eibe Frank. 2018. XGBoost: Scalable GPU Accelerated Learning. arXiv:1806.11248 (2018).

[39] Decebal Mocanu, Elena Mocanu, Peter Stone, Phuong Nguyen, Madeleine Gibescu, and Antonio Liotta. 2018. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications 9 (12 2018).

[40] Alex Mott, Daniel Zoran, Mike Chrzanowski, Daan Wierstra, and Danilo J. Rezende. 2019. S3TA: A Soft, Spatial, Sequential, Top-Down Attention Model. https://openreview.net/forum?id=B1gJOoRcYQ

[41] Sharan Narang, Gregory F. Diamos, Shubho Sengupta, and Erich Elsen. 2017. Exploring Sparsity in Recurrent Neural Networks. arXiv:1704.05119 (2017).

[42] Nbviewer. 2019. Notebook on Nbviewer. https://nbviewer.jupyter.org/github/ dipanjanS/data science for all/blob/master/tds model interpretation xai/ Human-interpretableMachineLearning-DS.ipynb#

[43] N. C. Oza. 2005. Online bagging and boosting. In IEEE Trans Conference on Systems, Man and Cybernetics.

[44] German Ignacio Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. 2018. Continual Lifelong Learning with Neural Networks: A Review. arXiv:1802.07569 (2018).

[45] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. In NIPS.

[46] Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434 (2015).

[47] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y. Ng. 2007. Self-Taught Learning: Transfer Learning from Unlabeled Data. In ICML.

[48] Marco Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. fiWhy Should I Trust You?fi: Explaining the Predictions of Any Classifier. In KDD.

[49] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning Important Features Through Propagating Activation Differences. arXiv:1704.02685 (2017).

[50] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 (2014).

[51] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2018. AutoInt: Automatic Feature Interaction Learning via SelfAttentive Neural Networks. arxiv:1810.11921 (2018).

[52] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic Attribution for Deep Networks. arXiv:1703.01365 (2017).

[53] Ryutaro Tanno, Kai Arulkumaran, Daniel C. Alexander, Antonio Criminisi, and Aditya V. Nori. 2018. Adaptive Neural Trees. arXiv:1807.06699 (2018).

[54] Tensorflow. 2019. Classifying Higgs boson processes in the HIGGS Data Set. https://github.com/tensorflow/models/tree/master/offcial/boosted trees

[55] Trieu H. Trinh, Minh-Thang Luong, and Quoc V. Le. 2019. Selfie: Self-supervised Pretraining for Image Embedding. arXiv:1906.02940 (2019).

[56] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol ¨ Vinyals, et al. 2016. WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499 (2016).

[57] Sethu Vijayakumar and Stefan Schaal. 2000. Locally Weighted Projection Regression: An O(n) Algorithm for Incremental Real Time Learning in High Dimensional Space. In ICML.

[58] Suhang Wang, Charu Aggarwal, and Huan Liu. 2017. Using a random forest to inspire a neural network and improving on it. In SDM.

[59] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning Structured Sparsity in Deep Neural Networks. arXiv:1608.03665 (2016).

[60] Yongxin Yang, Irene Garcia Morillo, and Timothy M. Hospedales. 2018. Deep Neural Decision Trees. arXiv:1806.06988 (2018).

[61] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2019. INVASE: Instancewise Variable Selection using Neural Networks. In ICLR.

Capacitación en datos tabulares. TABNet. Parte 1

Brevemente

1. Introducción

2.

3. TABNET

More articles: