Leveraging Human Knowledge for Better Statistical Generalization
Although we are living in the era of big data when from the web it is easy to obtain billions or trillions of words, there are many scenarios in which we cannot be too data hungry. For example, even in a billion-word corpus, there is a long tail of rare and out-of vocabulary words. Next, language is not always paired with correlated events: corpora contain what people said, but not what they meant, or how they understood things, or what they did in response to the language. Finally, the vast majority of the world's languages barely exist on the web at all..