M35209F/Μ36209P - Text Analytics

Ion Androutsopoulos

Description

This course is part of the MSc in Data Science of the Department of Informatics, Athens University of Economics and Business. The course covers algorithms, models, and systems that can be used to process and extract information from natural language text. 

Course Objectives/Goals

The course is concerned with algorithms, models, and systems that can be used to process and extract information from natural language texts. Text analytics methods are used, for example, in sentiment analysis and opinion mining, information extraction from documents, search engines and question answering systems. They are particularly important in corporate information systems, where knowledge is often expressed in natural language (e.g., minutes, reports, regulations, contracts, product descriptions, manuals, patents). Companies also interact with their customers mostly in natural language (e.g., via e-mail, call centers, web pages describing products, blogs and social media).

Prerequisites/Prior Knowledge

Basic knowledge of calculus, linear algebra, probability theory. For the programming assignments, programming experience in Python is required. The students will be allowed to implement the programming assignments of the course in any language, but Python is strongly recommended. An introduction to natural language processing and machine learning libraries (e.g., NLTK, scikit-learn, Keras, PyTorch) will be provided, and students will have the opportunity to use these libraries in the course’s assignments. For assignments that require training neural networks, cloud virtual machines with GPUs (e.g., in Google’s Colab) can be used.

Course Syllabus

The course comprises ten units of three hours each.

Units 1: Introduction, n-gram language models, spelling correction, text normalization

Introduction, course organization, examples of text analytics applications. n-gram language models. Estimating probabilities from corpora. Entropy, cross-entropy, perplexity. Edit distance. Applications in context-sensitive spelling correction and text normalization.

Units 2 & 3: Text classification with (mostly) linear classifiers

Representing texts as bags of words. Boolean and TF-IDF features. Feature selection and extraction using information gain and SVD. Text classification with k nearest neighbors and Naive Bayes. Obtaining word embeddings from PMI scores. Word and text clustering with k-means. Linear and logistic regression, stochastic gradient descent. Lexicon-based features. Constructing and using sentiment lexica. Practical advice and diagnostics for text classification with supervised machine learning.

Unit 4 & 5: Text classification with Multi-Layer Perceptrons

Natural and artificial neural networks. Perceptrons, training them with SGD, limitations. Multi-Layer Perceptrons (MLPs) and backpropagation. Dropout. MLPs for text classification, regression, window-based sequence labelling (e.g., for POS tagging, named entity recognition). Pre-training word embeddings with Word2Vec or FastText.

Unit 6 & 7: Natural language processing with Recurrent Neural Networks

Recurrent neural networks (RNNs), GRUs/LSTMs. Applications in POS tagging and named entity recognition. RNN language models. RNNs with self-attention and applications in text classification. Bidirectional and stacked RNNs. Obtaining word embeddings from character-based RNNs. Variational dropout. Hierarchical RNNs for text classification and sequence labeling. Sequence-to-sequence RNN models with attention, and applications in machine translation. Universal sentence encoders, LASER. Pretraining language models, context-aware embeddings, ELMo.

Unit 8 & 9: Natural language processing with Convolutional Neural Networks and Transformers

Convolutional neural networks (CNNs) and applications in NLP. Image to text generation with CNN encoders and RNN decoders. Key-value attention, multi-head attention, Transformers, BERT.

Unit 10: Parsing and relation extraction

Grammars, phrase structure trees, dependency trees. Transition-based and graph-based dependency parsing with deep learning. Relation extraction with deep learning, including graph convolutions on parse trees. Joint parsing and relation extraction.

 

Bibliography

There is no required textbook. Extensive notes in the form of slides are provided.

The course is mainly based on the books:
- Speech and Language Processing, by D. Jurafsky and J.H. Martin, 2nd edition, Pearson, 2009. The 3rd edition (in preparation) is freely available (http://web.stanford.edu/~jurafsky/slp3/).
- Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan & Claypool, 2017.

Both books can be found at AUEB’s library.

Assessment Methods

In each unit, study exercises are provided (solved and unsolved, some requiring programming), of which one or two per unit are handed in (as assignments). Students are graded for their participation in class (10%), assignments (45%), and their performance at the final exam (45%).

Instructors

Instructor: Ion Androutsopoulos (http://www.aueb.gr/users/ion/contact.html)

Labs and assignments assistant for full-time students: Vasiliki Kougia (vasilikvasilikikou at gmail com).

Labs and assignments assistant for part-time students: Manolis Kyriakakis (kiriakakis16 at aueb gr).

Calendar