M35209F/Μ36209P - Text Analytics (MSc Data Science)

Ion Androutsopoulos

Description

This course is part of the MSc in Data Science at the Athens University of Economics and Business. The course covers algorithms, models, and systems that can be used to process and extract information from natural language text. 

Course Objectives/Goals

The course is concerned with algorithms, models, and systems that can be used to process and extract information from natural language text. Text analytics methods are used, for example, in sentiment analysis and opinion mining, information extraction from documents, search engines and question answering systems. They are particularly important in corporate information systems, where knowledge is often expressed in natural language (e.g., minutes, reports, regulations, contracts, product descriptions, manuals, patents). Companies also interact with their customers mostly in natural language (e.g., via e-mail, call centers, web pages describing products, blogs and social media).

Upon completion of the course, students will be able to:
1. Describe a wide range of possible applications of Text Analytics in Data Science.
2. Describe Text Analytics algorithms that can be used in Data Science applications.
3. Select and implement appropriate Text Analytics algorithms for particular Data Science applications.
4. Evaluate the effectiveness and efficiency of Text Analytics methods and systems.

Prerequisites/Prior Knowledge

Basic knowledge of calculus, linear algebra, probability theory. For the programming assignments, programming experience in Python is required. An introduction to natural language processing and machine learning libraries (e.g., NLTK, spaCy, scikit-learn, Tensorflow/Keras or PyTorch) will be provided, and students will have the opportunity to use these libraries in the course’s assignments. For assignments that require training neural networks, cloud virtual machines with GPUs (e.g., in Google’s Colab) can be used.

Course Syllabus

The course comprises ten units of three hours each.

Unit 1: Introduction, n-gram language models

Introduction, course organization, examples of text analytics applications. n-gram language models. Estimating probabilities from corpora. Entropy, cross-entropy, perplexity. Applications in context-aware spelling correction and text generation with beam search decoding.

Units 2 & 3: Text classification with (mostly) linear classifiers

Representing texts as bags of words. Boolean and TF-IDF features. Feature selection and extraction using information gain and SVD. Text classification with k nearest neighbors and Naive Bayes. Obtaining word embeddings from PMI scores. Word and text clustering with k-means. Linear and logistic regression, stochastic gradient descent. Evaluating classifiers with precision, recall, F1, ROC AUC. Practical advice and diagnostics for text classification with supervised machine learning.

Units 4 & 5: Text classification with Multi-Layer Perceptrons

Perceptrons, training them with SGD, limitations. Multi-Layer Perceptrons (MLPs) and backpropagation. Dropout, batch and layer normalization. MLPs for text classification, regression, window-based sequence labelling (e.g., for POS tagging, named entity recognition). Pre-training word embeddings, Word2Vec. Advice for training large neural networks.

Units 6 & 7: Natural language processing with Recurrent Neural Networks

Recurrent neural networks (RNNs), GRUs/LSTMs. Applications in token classification (e.g., POS tagging, named entity recognition). RNN language models. RNNs with self-attention and applications in text classification. Bidirectional and stacked RNNs. Obtaining word embeddings from character-based RNNs. Hierarchical RNNs for text classification and token classification. Sequence-to-sequence RNN models with attention, and applications in machine translation.

Units 8 & 9: Natural language processing with Convolutional Neural Networks and Transformers

Quick background on Convolutional neural networks (CNNs) in Computer Vision. Text processing with CNNs. Key-query-value attention, multi-head attention, Transformer encoders and decoders. Pre-trained Transformers and Large Language Models (LLMs), BERT, SMITH, BART, T5, GPT-3, InstructGPT ChatGPT, fine-tuning them, prompting them. Retrieval augmented generation (RAG), LLMs with tools.

Unit 10: Introduction to speech recognition and dialog systems

Introduction to automatic speech recognition (ASR) and systems for spoken and written dialogs. Deep learning encoders of speech segments, wav2vec, HuBERT, encoder-decoder and encoder-only ASR models. Dialog system architectures, intent recognition and dialog tracking using neural models, dialog systems based on pretrained LLMs.

Bibliography

There is no required textbook. Extensive notes in the form of slides are provided.

Recommended books:
- Speech and Language Processing, Daniel Jurafsky and James H. Martin. Pearson Education, 2nd edition, 2009, ISBN-13: 978-0135041963. See also the 3rd edition (in preparation): https://web.stanford.edu/~jurafsky/slp3/. 
- Neural Network Methods for Natural Language Processing, Yoav Goldberg. Morgan & Claypool Publishers, 2017, ISBN-13: 978-1627052986. Available at AUEB's library.
- Introduction to Natural Language Processing, Jacob Eisenstein. MIT Press, 2019, ISBN-13: 978-0262042840. Draft online: https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf
- Foundations of Statistical Natural Language Processing, Christopher D. Manning and Hinrich Schütze. MIT Press, 1999, ISBN-13: 978-0262133609. Available at AUEB's library.
- An Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Cambridge University Press, 2008, ISBN-13: 978-0521865715. Free online: http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Assessment Methods

In each unit, study exercises are provided (solved and unsolved, some requiring programming), of which one or two per unit are handed in (as assignments). The final grade is the average of the final examination grade (50%) and the grade of the assignments to be submitted (50%), provided that the final examination grade is at least 5/10. Otherwise, the final grade equals the final examination grade.

Instructors

Instructor: Ion Androutsopoulos. For contact info and office hours, see http://www.aueb.gr/users/ion/contact.html. 

Labs and assignments assistant for full-time students (2023-24): Foivos Charalampakos (phoebuschar at aueb gr).

Labs and assignments assistant for part-time students (2023-24): Manolis Kyriakakis (makyr90 at gmail com).