Παρουσίαση/Προβολή

Επεξεργασία Φυσικής Γλώσσας [ΨΜΑΕ]

(INF394) - Ιωάννης Παυλόπουλος

Περιγραφή Μαθήματος

Introduction to natural language processing and applications to digital humanities

Natural language processing with open-source language technology libraries and applications to real world problems, from modelling language in classic literature, to classification and clustering, to information extraction.

Ημερομηνία δημιουργίας

Τρίτη, 2 Οκτωβρίου 2018

Σελίδα μαθήματος

Περίγραμμα
Διδάσκοντες

John Pavlopoulos

Assistant Professor (elected) at the Athens University of Economics and Business, Greece

annis@aueb.gr
Βιβλιογραφία
- S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python. O’Reilly Media, 2009. [link]
- D. Jurafsky and J. H. Martin. Speech and Language Processing. London:Pearson (Vol.3), 2014.[link]
- M. Piotrowski. Natural Language Processing for Historical Texts. Morgan & Claypool, 2012. [Hardcopies exist in library.]
Μέθοδοι αξιολόγησης

In each unit, study exercises are provided (solved and unsolved, some requiring programming), of which one or two per unit are handed in (as assignments). Students are graded through homework assignments (10%) and written reports of two assignments (30%) that will also be examined orally (60%).
Μαθησιακοί στόχοι

After successfully completing the course, students will be able to:
- Comprehend basic aspects of natural language processing and explain abstractly how core language technology components work.
- Discern problems in digital humanities that can be solved with natural language processing.
- Apply natural language processing to digital humanities problems.
- Participate and even lead data collection, annotation, and evaluation efforts for language technology applications to digital humanities.
Μέθοδοι διδασκαλίας

3-hour lecture per week. Both theoretical and practical aspects are covered and students will be writing code in class using notebooks (e.g., Jupyter, Colab).
Περιεχόμενο μαθήματος

The course comprises ten units and lectures of three hours each. One hour will be used to explain basic theoretical aspects and two hours will be used for the students to familiarize with open source libraries and their usage.

Unit 1: Introduction

Theory: Course organization, syllabus, examples of language technology applications, discussion of ethical implications.

Suggested literature: Read Chapter 1 from Bird & al. Browse the proceedings of related workshops (e.g., LaTeCH & SIGHUM) for examples of language technology applications to digital humanities.

Practice:
- Hands-on basic text processing with Python.
- Familiarization with core libraries commonly used in language technology.
Unit 2: Corpora, Texts, and Tokens

Theory: Introduction to digital corpora and discussion about their usage. Regular Expressions. Text normalization.

Suggested literature: Chapters 2 and 3 from Bird & al. and Chapter 2 (2.1-2.4) from Jurafski & Martin.

Practice:
- Download collections, and apply regular expressions for text normalization.
- Count word frequencies and detect hapax legomena and stopwords.
Unit 3: Statistical Language Modelling

Theory: N-grams, language modelling, edit-distance and error detection/correction.

Suggested literature: Chapter 2 (2.5) and Chapter 3 from Jurafski & Martin.

Practice:
- Train a statistical language model and apply it to a task (e.g., authorship).
- Edit distance and language modeling for error detection/correction (e.g., HTR or L2).
Unit 4: Data Annotation

Theory: Compiling guidelines, setting up an annotation environment, inter-annotator agreement, best practices.

Suggested literature: Chapter 11 (11.2.2, 11.3.5, 11.4.1, 11.4.2) from Bird & al.

Practice:
- Hands-on common annotation tools.
- Perform a live pilot experiment, with students being both administrators and annotators.
Unit 5: Visualisation

Theory: Introduction to libraries, rules, and practical experience with data visualisation.

Practice:
- Use the libraries to visualise the recently annotated data.
- Try to reproduce an existing visualisation taken (e.g.) from a research paper.
Unit 6: Language Representations

Theory: Text representations based on bags of words; TF-IDF; word embeddings.

Suggested literature: Chapter 6 from Jurafski & Martin.

Practice:
- Represent texts as bags of words and TF-IDF vectors. Compute similarities.
- Apply word2vec to a corpus and explore the distributional semantics space (figurative language).
Unit 7: Classification

Theory: Linear and non-linear classifiers. Evaluation.

Suggested literature: Chapter 6 (6.1.1-6.1.3, 6.2, 6.3) from Bird & al. and Chapter 4 (excluding 4.1-4.6 and 4.10) from Jurafski & Martin.

Practice:
- Toxicity detection and/or authorship attribution.
- More applications, or introduction to optical character recognition.
Unit 8: Contextual Representations

Theory: Introduction to neural networks and language models. Moving from static to contextual word representations.

Suggested literature: Chapters 7 and 9 from Jurafski & Martin.

Practice:
- Fine-tuning a neural language model for text classification in a few lines of code.
- Extracting word and text representations from large language models.
Unit 9: Clustering

Theory: K-means and introduction to topic modelling.

Practice:
- Use the k-means algorithm to cluster a collection of texts or text images.
- Use Latent Dirichlet Allocation and visualise the topics.
Unit 10: Assignment Presentation and/or Invited Speaker