Course : M35209F/Μ36209P - Text Analytics (MSc Data Science)

Course code : INF312

Course Description

This course is part of the MSc in AI and Data Science at the Athens University of Economics and Business.  The course covers algorithms, models and systems that allow computers to "understand" and generate natural language text, including Large Language Models (LLMs).

  • Course Objectives/Goals

    Upon completion of the course, students will be able to:
    1. Describe a wide range of possible applications of Text Analytics in Data Science.
    2. Describe Text Analytics algorithms that can be used in Data Science applications.
    3. Select and implement appropriate Text Analytics algorithms for particular Data Science applications.
    4. Evaluate the effectiveness and efficiency of Text Analytics methods and systems.

    Prerequisites/Prior Knowledge

    Basic knowledge of calculus, linear algebra, probability theory. For the programming assignments, programming experience in Python is required. An introduction to natural language processing and machine learning libraries (e.g., NLTK, spaCy, scikit-learn, PyTorch) will be provided, and students will have the opportunity to use these libraries in the course’s assignments. For assignments that require training neural networks, cloud virtual machines with GPUs (e.g., in Google’s Colab) can be used.

    Course Syllabus

    The course comprises ten units of three hours each.

    Unit 1: Introduction, n-gram language models

    Introduction, course organization, examples of text analytics applications. n-gram language models. Estimating probabilities from corpora. Entropy, cross-entropy, perplexity. Applications in context-aware spelling correction and text generation with beam search decoding.

    Unit 2: Text classification with (mostly) linear classifiers

    Representing texts as bags of words. Boolean and TF-IDF features. Feature selection and extraction using information gain and SVD. Obtaining word embeddings from PMI scores. Word and text clustering with k-means. Quick recap of text classification with k nearest neighbors, linear and logistic regression, stochastic gradient descent. Evaluating classifiers with precision, recall, F1, ROC AUC. Practical advice and diagnostics for text classification with supervised machine learning.

    Units 3 & 4: Text classification with Multi-Layer Perceptrons

    Multi-Layer Perceptrons (MLPs), computation graphs, backpropagation. Dropout, batch and layer normalization. MLPs for text classification, regression, token classification (e.g., for named entity recognition). Pre-training word embeddings, Word2Vec. Advice for training large neural networks.

    Units 5 & 6: Natural language processing with Recurrent Neural Networks

    Recurrent neural networks (RNNs), GRUs/LSTMs. RNN language models. RNNs with self-attention or global max-pooling, and applications in text classification. Bidirectional and stacked RNNs. Obtaining word embeddings from character-based RNNs. Hierarchical RNNs. Encoder-decoder RNN models with attention, and applications in machine translation.

    Unit 7: Natural language processing with Convolutional Neural Networks

    Quick background on Convolutional neural networks (CNNs) in Computer Vision. Text processing with CNNs. Image to text generation with CNN encoders and RNN decoders.

    Units 8 & 9: Natural language processing with Transformers and Large Language Models

    Transformer encoders, BERT. Encoder-decoder Transformers, BART, T5. Decoder-only Transformers, GPT-x. Prompting, supervised fine-tuning, RLHF, DPO. Parameter efficient training, LoRA. Retrieval augmented generation (RAG), LLMs with tools, agents, ReACT. Adding vision to LLMs, LLaVA, InstructBLIP. Data augmentation for NLP.

    Unit 10: Introduction to speech recognition and dialog systems

    Introduction to automatic speech recognition (ASR). Deep learning encoders of speech segments, wav2vec, HuBERT, encoder-decoder and encoder-only ASR models. Dialog system architectures, intent recognition and dialog tracking using neural models, dialog systems based on LLMs.

    Bibliography

    There is no required textbook. Extensive notes in the form of slides are provided.

    Recommended books:

    • Speech and Language Processing, Daniel Jurafsky and James H. Martin, Pearson Education, 2nd edition, 2009, ISBN-13: 978-0135041963. A draft of the 3rd edition is freely available (https://web.stanford.edu/~jurafsky/slp3/).
    • Deep Learning for Natural Language Processing: A Gentle Introduction, Mihai Surdeanu and Marco A. Valenzuela-Escarcega, Cambridge University Presss, 2024, ISBN-13: 978-1316515662. Free draft available (https://clulab.org/gentlenlp/text.html). 
    • Neural Network Methods for Natural Language Processing, Yoav Goldberg, Morgan & Claypool Publishers, 2017, ISBN-13: 978-1627052986.

    Assessment Methods

    In each unit, study exercises are provided (solved and unsolved, some requiring programming), of which one or two per unit are handed in (as assignments). The final grade is the average of the final examination grade (50%) and the grade of the assignments to be submitted (50%), provided that the final examination grade is at least 5/10. Otherwise, the final grade equals the final examination grade.

    Instructors

    Instructor: Ion Androutsopoulos. For contact info and office hours, see http://www.aueb.gr/users/ion/contact.html. 

    Labs and assignments assistant for full-time students (2025-26): Foivos Charalampakos (phoebuschar at aueb gr).

    Labs and assignments assistant for part-time students (2025-26): Manolis Kyriakakis (makyr90 at gmail com). 

Agenda

Due day
Course event
System event
Personal event

Announcements

All announcements...