Big Data Systems and Techniques (Μ36211P / Μ35211F)

Dinos Arkoumanis


Big Data Systems and techniques (6 units)

Dr. Dinos Arkoumanis:


Techniques and best practices for the development of production Big Data systems using Parquet and ORC columnar storage files in Hadoop and the Apache Spark data processing framework with SQL Query Engines (Spark SQL). Integration with latest parallel Machine Learning Frameworks. Cloud service technologies like Amazon EMR. Streaming and real time processing with Apache Storm + Kafka. Key Outcomes After completing the course, the students will be able to: - Set up a Hadoop cluster from scratch - Set up an Apache Spark cluster from scratch - Import JSON/CSV data on an Apache Spark cluster and save it to files in HDFS with columnar formats Parquet and ORC with Java - Map HDFS files as HIVE tables and query them with Spark SQL - Set up an Apache Spark Cluster in Amazon AWS with EMR technology - Process data with Apache Spark feed them to Spark ML machine learning algorithms and save trained




