Many current application domains of machine learning and artificial intelligence involve knowledge discovery from text, such as sentiment analysis, document ontology and spam detection. Humans have years of experience and training with language, enabling them to understand complicated, nuanced text passages with relative ease. Text classifiers attempt to emulate or replicate this knowledge so that computers can discriminate between concepts encountered in text. Learning high-level concepts from text, such as those found in many applications of text classification, is a difficult task due to the many challenges associated with text mining and classification. A common approach to this machine learning task is to find a way to represent text, such as human-engineered features extracted from the text, then using this representation with training data to train a classifier. In the current text classification paradigm, human researchers create an approach for extracting features from text by looking at words, phrases, parts of speech, other morphological features, employing a lexicon, and mapping relations between words to determine concept families and semantic similarities. These different types of features may be used separately or combined into a single feature space in an attempt to build a representation capturing the complexity of language. Unfortunately, these approaches are often language or even task dependant [1] and may contain less information than the original text. Thus, this approach may yield feature engineering that is only applicable to a narrow range of text classification tasks.
One of the most commonly used method of feature engineering in text is the bag-of-word approach. A word vector is constructed from the data by parsing the data (text documents), identifying the words used throughout all documents and making a vector with entries representing each word, then ascribing values to this vector for each instance or document using word presence of frequency. While this approach has been demonstrated to be effective, it is far from ideal. Word placement within a sentence is not preserved and contextual indicators that may change the meaning of a word are absent from the final feature space. Additionally, a very large number of words can be identified with many being infrequent or unique to a single document leading to a high dimensional, sparse feature space. This increases computational costs and can degrade performance due to overfitting. While high dimensionality may be addressed with feature selection techniques [2], this approach to text understanding eliminates the majority of information that humans use when reading and writing text.
More complicated approaches have been devised and combined (or used in place of) the bag-of-words approach, such as grouping words by semantic concepts [3], the addition of part of speech tags [4], detecting grammatical patterns [5], and word ontology driven natural language processing (NLP) [6]; however, these feature spaces still contain less information than the original text as the full structure and context of the text is not preserved. Additionally, as features may be designed with key domain knowledge, more complicated feature engineering methodologies result in a text understanding solution that is less versatile and may be restricted to a single text domain. While additional information from the text can be turned into features, the burden of engineering features from text still falls upon the human researcher. A researcher implements a feature engineering methodology to extract features they believe will be useful; however, this is ultimately an educated guess. The resulting feature space is an incomplete representation of the text and may be missing valuable information that may not be easily identifiable to the researcher.
Deep neural networks provide an alternative approach for text mining tasks and feature extraction. High-level features can be learned automatically, allowing for the removal of human bias in feature engineering and the preservation of more information as the original data can be used for training. As features are learned as part of the training process of a deep neural network, researchers are not required to provide any specialized domain knowledge to the network making this family of approaches language and task independent. Instead, large volumes (potentially petabytes) of data are leveraged to train through repeated example.
Convolutional neural networks (CNN) are a family of neural networks that have been shown to be among the best solutions for training computers in tasks of computer vision. Krizhevsky et al. [7] developed a deep CNN that outperformed all previously existing approaches for image object recognition on the ImageNet dataset, due to its ability to detect high-level, abstract features for image detection. Additionally, they have recently been demonstrated to be able to learn high-level text concepts from character-level representations of text [1] in a manner similar to how they learn features from and can classify images. CNNs do not need any prior knowledge of the data to train a classifier as complex features can be learned automatically when training the network. This makes them well suited to text classification, since they do not need any prior knowledge of language. Starting from raw, character-level text data, abstract language concepts, including words, syntax, grammar and semantic similarities, are learned automatically. Zhang et al. [1] demonstrated using deep CNNs with character-level data representations of text outperforms approaches using higher level human generated text representations such as bag-of-words. As the network is trained from raw text no information is lost in constructing features. Also, since the network can be trained from character-level text representations, data dimensionality is less of an issue as there are less commonly used characters compared to words.
In this paper, we investigate the use of our new embedding for character-level representation of text classification tasks for use with CNNs and network design considerations due to the adoption of this embedding approach. First, we explain our new character embedding approach and demonstrate that it greatly reduces memory use and network training time due to greatly reducing the size of the initial input received by the network. We show that it outperforms the previous character embedding for the task of binary tweet sentiment classification, i.e. determining if tweets convey positive or negative sentiment. We also show that our character embedding can employ a larger alphabet at little to no additional cost, further enhancing performance, as training time scales logarithmically with alphabet size instead of linearly. Furthermore, we also explore network design implications due to using our new embedding. Namely, our embedding results in matrix representations of text instances where one dimension is much greater than the other. In this scenario, convolutional layers with padding are required to allow deeper networks to be trained.
This paper demonstrates the effectiveness of a new character embedding designed for training CNNs and explores how neural network design may be impacted by its adoption. To the best of our knowledge, text classification using character-level representations and deep neural networks has been previously investigated by only one research group [1], and we are the first to propose a more compact character embedding and to investigate its implications on neural network design. We show that our character embedding greatly reduces computational costs and training time, and improves classification performance. Additionally, we also show that using padded convolutional layers allows our embedding to be used with networks of arbitrary depth and the use of padding does not negate the benefits of our character embedding. Thus, our proposed character embedding can be adapted to any big data domain where high-level understanding of text is required, such as sentiment analysis, webpage ontology and topic classification.
The remainder of this paper is organized as follows. “Related works” section provides an overview of related work in text mining and deep learning. “Character-level text representations” section describes our newly created embedding for character-level representation. “Convolutional neural network design” section describes how convolutional neural networks work and design considerations that must be made on account of our new embedding. “General methodology” section provides details on the experimental methodology used to train and evaluate networks. “Experimental results” section presents results for our embedding and the use of padded layers. Finally, conclusions and future work are contained in “Conclusion” section.