Big Data Systems and Techniques (Μ36211P / Μ35211F)
Dinos Arkoumanis
Big Data Systems and techniques (6 units)
Dr. Dinos Arkoumanis: arkoumanis.dinos@gmail.com
Overview
Techniques and best practices for the development of production Big Data systems using Parquet and ORC columnar storage files in Hadoop and the Apache Spark data processing framework with SQL Query Engines (Spark SQL). Integration with latest parallel Machine Learning Frameworks. Cloud service technologies like Amazon EMR. Streaming and real time processing with Apache Storm + Kafka. Key Outcomes After completing the course, the students will be able to: - Set up a Hadoop cluster from scratch - Set up an Apache Spark cluster from scratch - Import JSON/CSV data on an Apache Spark cluster and save it to files in HDFS with columnar formats Parquet and ORC with Java - Map HDFS files as HIVE tables and query them with Spark SQL - Set up an Apache Spark Cluster in Amazon AWS with EMR technology - Process data with Apache Spark feed them to Spark ML machine learning algorithms and save trained models - Use Apache Kafka to save datastream feeds to multiple storage systems - Use Apache Storm for real time processing of datastream feeds
Requirements and Prerequisites
In this course you will use the $300 free resources provided by Google cloud. To attend the class you should create a new google account (or use your current google account if you never used the free cloud services) and go to: https://cloud.google.com/free/ to activate your free google cloud account. The free credits last for a year. Google would require to use a credit/debit card - it is safe to use the card and you will not be charged even if you spend your free credits.
This course is hand-on and students will be evaluated by a final hands-on project. The course does not assume any prior experience in Apache Spark, Hadoop or any other software that will be presented. However, basic knowledge of programming and computer science concepts is required. Good knowledge of Python, Java and SQL is necessary. Students will need to bring their laptops in class in order to try out interactively the material being presented.
Required Course Materials: There is no required textbook. All course materials will be provided in class and available for downloading. The course is coordianted with wiki in Bitbucket.
Books: There are many books on the subject; the following selection provides a good foundation for those students who wish to delve deeper on the topics discussed in class:
- Fast Data Processing with Spark, 2nd Edition. Krishna Sankar, Holden Karau. Packt publishing
- Learning Spark Lightning-Fast Big Data Analysis.Holden Karau, Andy Konwinski, Patrick Wendell. Matei Zaharia. O’Reilly
Grading Students will be graded on a final project. Each student will work alone on the final project. The project will have requirements and each requirement will contribute 10% to 50% of the final score if it is successfully delivered. Copied work from other students will be rejected automatically. The course does not have exams but students will present their projects to the class.
Participation All lectures will require the use of your laptop. Attendance Requirements This is a hands-on course. There is no point getting it if someone does not plan to attend it. Students are responsible for keeping up with the course material, including lectures, from the first day of this class, forward. It is the student's obligation to bring oneself up to date on any missed coursework.
ΛιγότεραBig Data Systems and techniques (6 units)
Dr. Dinos Arkoumanis: arkoumanis.dinos@gmail.com
Overview
Techniques and best practices for the development of production Big Data systems using Parquet and ORC columnar storage files in Hadoop and the Apache Spark data processing framework with SQL Query Engines (Spark SQL). Integration with latest parallel Machine Learning Frameworks. Cloud service technologies like Amazon EMR. Streaming and real time processing with Apache Storm + Kafka. Key Outcomes After completing the course, the students will be able to: - Set up a Hadoop cluster from scratch - Set up an Apache Spark cluster from scratch - Import JSON/CSV data on an Apache Spark cluster and save it to files in HDFS with columnar formats Parquet and ORC with Java - Map HDFS files as HIVE tables and query them with Spark SQL - Set up an Apache Spark Cluster in Amazon AWS with EMR technology - Process data with Apache Spark feed them to Spark ML machine learning algorithms and save trained
Big Data Systems and techniques (6 units)
Dr. Dinos Arkoumanis: arkoumanis.dinos@gmail.com
Overview
Techniques and best practices for the development of production Big Data systems using Parquet and ORC columnar storage files in Hadoop and the Apache Spark data processing framework with SQL Query Engines (Spark SQL). Integration with latest parallel Machine Learning Frameworks. Cloud service technologies like Amazon EMR. Streaming and real time processing with Apache Storm + Kafka. Key Outcomes After completing the course, the students will be able to: - Set up a Hadoop cluster from scratch - Set up an Apache Spark cluster from scratch - Import JSON/CSV data on an Apache Spark cluster and save it to files in HDFS with columnar formats Parquet and ORC with Java - Map HDFS files as HIVE tables and query them with Spark SQL - Set up an Apache Spark Cluster in Amazon AWS with EMR technology - Process data with Apache Spark feed them to Spark ML machine learning algorithms and save trained