Beyond Softmax: Sparsemax, Constrained Softmax, Differentiable Easy-First
In the first part of the talk, Andre will propose sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities. After deriving its properties, he will show how its Jacobian can be efficiently computed, enabling its use in a network trained with backpropagation. Then, he will propose a new smooth and convex loss function which is the sparsemax analogue of the logistic loss, revealing an unexpected connection with the Huber classification loss. Andre will show promising empirical results in multi-label classification problems and in attention-based neural networks for natural language inference.
In the second part, Andre will introduce constrained softmax, another activation function that allows imposing upper bound constraints on attention probabilities. Based on this activation, he will introduce a novel neural end-to-end differentiable easy-first decoder that learns to solve sequence tagging tasks in a flexible order. The decoder iteratively updates a "sketch" of the predictions over the sequence. The proposed models compare favourably to BILSTM taggers on three sequence tagging tasks.