M35209F/Μ36209P - Text Analytics (MSc Data Science)

Ion Androutsopoulos

Description

This course is part of the MSc in Data Science at the Athens University of Economics and Business. The course covers algorithms, models, and systems that can be used to process and extract information from natural language text. 

Course Objectives/Goals

The course is concerned with algorithms, models, and systems that can be used to process and extract information from natural language text. Text analytics methods are used, for example, in sentiment analysis and opinion mining, information extraction from documents, search engines and question answering systems. They are particularly important in corporate information systems, where knowledge is often expressed in natural language (e.g., minutes, reports, regulations, contracts, product descriptions, manuals, patents). Companies also interact with their customers mostly in natural language (e.g., via e-mail, call centers, web pages describing products, blogs and social media).

Upon completion of the course, students will be able to:
1. Describe a wide range of possible applications of Text Analytics in Data Science.
2. Describe Text Analytics algorithms that can be used in Data Science applications.
3. Select and implement appropriate Text Analytics algorithms for particular Data Science applications.
4. Evaluate the effectiveness and efficiency of Text Analytics methods and systems.

Prerequisites/Prior Knowledge

Basic knowledge of calculus, linear algebra, probability theory. For the programming assignments, programming experience in Python is required. An introduction to natural language processing and machine learning libraries (e.g., NLTK, spaCy, scikit-learn, PyTorch) will be provided, and students will have the opportunity to use these libraries in the course’s assignments. For assignments that require training neural networks, cloud virtual machines with GPUs (e.g., in Google’s Colab) can be used.

Course Syllabus

The course comprises ten units of three hours each.

Unit 1: Introduction, n-gram language models

Introduction, course organization, examples of text analytics applications. n-gram language models. Estimating probabilities from corpora. Entropy, cross-entropy, perplexity. Applications in context-aware spelling correction and text generation with beam search decoding.

Unit 2: Text classification with (mostly) linear classifiers

Representing texts as bags of words. Boolean and TF-IDF features. Feature selection and extraction using information gain and SVD. Obtaining word embeddings from PMI scores. Word and text clustering with k-means. Quick recap of text classification with k nearest neighbors, linear and logistic regression, stochastic gradient descent. Evaluating classifiers with precision, recall, F1, ROC AUC. Practical advice and diagnostics for text classification with supervised machine learning.

Units 3 & 4: Text classification with Multi-Layer Perceptrons

Perceptrons, training them with SGD, limitations. Multi-Layer Perceptrons (MLPs) and backpropagation. Dropout, batch and layer normalization. MLPs for text classification, regression, token classification (e.g., for POS tagging, named entity recognition). Pre-training word embeddings, Word2Vec. Advice for training large neural networks.

Units 5 & 6: Natural language processing with Recurrent Neural Networks

Recurrent neural networks (RNNs), GRUs/LSTMs. Applications in token classification (e.g., named entity recognition). RNN language models. RNNs with self-attention or global max-pooling, and applications in text classification. Bidirectional and stacked RNNs. Obtaining word embeddings from character-based RNNs. Hierarchical RNNs. Sequence-to-sequence RNN models with attention, and applications in machine translation.

Unit 7: Natural language processing with Convolutional Neural Networks

Quick background on Convolutional neural networks (CNNs) in Computer Vision. Text processing with CNNs. Image to text generation with CNN encoders and RNN decoders.

Units 8 & 9: Natural language processing with Transformers and Large Language Models

Key-query-value attention, multi-head attention, Transformer encoder and decoder blocks. Pre-trained Transformers and Large Language Models (LLMs), BERT, SMITH, BART, T5, GPT-3, InstructGPT ChatGPT, and open-source alternatives, fine-tuning them, prompting them. Parameter-efficient training, LoRA. Retrieval augmented generation (RAG), LLMs with tools. Data augmentation for NLP. Adding vision to LLMs, LLaVA, InstructBLIP.

Unit 10: Introduction to speech recognition and dialog systems

Introduction to automatic speech recognition (ASR) and systems for spoken and written dialogs. Deep learning encoders of speech segments, wav2vec, HuBERT, encoder-decoder and encoder-only ASR models. Dialog system architectures, intent recognition and dialog tracking using neural models, dialog systems based on pretrained LLMs.

Bibliography

There is no required textbook. Extensive notes in the form of slides are provided.

Recommended books:

  • Speech and Language Processing, Daniel Jurafsky and James H. Martin, Pearson Education, 2nd edition, 2009, ISBN-13: 978-0135041963. A draft of the 3rd edition is freely available (https://web.stanford.edu/~jurafsky/slp3/).
  • Deep Learning for Natural Language Processing: A Gentle Introduction, Mihai Surdeanu and Marco A. Valenzuela-Escarcega, Cambridge University Presss, 2024, ISBN-13: 978-1316515662. Free draft available (https://clulab.org/gentlenlp/text.html). 
  • Neural Network Methods for Natural Language Processing, Yoav Goldberg, Morgan & Claypool Publishers, 2017, ISBN-13: 978-1627052986.
  • Introduction to Natural Language Processing, Jacob Eisenstein, MIT Press, 2019, ISBN-13: 978-0262042840. Free draft available (https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf). 
  • Foundations of Statistical Natural Language Processing, Christopher D. Manning and Hinrich Schütze, MIT Press, 1999, ISBN-13: 978-0262133609.
  • An Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press, 2008, ISBN-13: 978-0521865715. Freely available (http://nlp.stanford.edu/IR-book/information-retrieval-book.html). 
Assessment Methods

In each unit, study exercises are provided (solved and unsolved, some requiring programming), of which one or two per unit are handed in (as assignments). The final grade is the average of the final examination grade (50%) and the grade of the assignments to be submitted (50%), provided that the final examination grade is at least 5/10. Otherwise, the final grade equals the final examination grade.

Instructors

Instructor: Ion Androutsopoulos. For contact info and office hours, see http://www.aueb.gr/users/ion/contact.html. 

Labs and assignments assistant for full-time students (2024-25): Foivos Charalampakos (phoebuschar at aueb gr).

Labs and assignments assistant for part-time students (2024-25): Manolis Kyriakakis (makyr90 at gmail com).