Learning Outcomes

Recent advances in technology have led to rapid growth of big data. This led to the need for cost efficient and scalable analysis algorithms. In this course concepts for scalable analysis of big data sets will be presented and applied using open source technologies. Participants of this module will gain an in-depth understanding of concepts and methods as well as practical experience in the area of scalable data science. The course is principally designed to impart: technical skills (50%), method skills (30%), system skills (10%), and social skills (10%).


Content

The module will focus on mainstream distributed processing platforms and paradigms and learn how to employ these to solve challenging big data problems using popular data mining methods. Students will learn how to implement and employ varying data mining algorithms, such as Naïve Bayes, K-Means Clustering, and PageRank on varying open-source systems (e.g., Apache Hadoop, Apache Flink).

Description of Teaching and Learning Methods

This Integrated Course (Integrierte Veranstaltung, IV) consists of: (i) lectures on key concepts, (ii) practical theoretical & programming exercises, and (iii) student lead presentations (including literature search). Active participation and contributions to all parts of this course are essential.


Requirements for participation and examination

Desirable prerequisites for participation in the  courses:

Computer science topics addressed in TU Berlin modules in the Bachelor’s curriculum, particularly, the database course (“Information Systems and Data Analysis”) or the equivalent, as well as excellent JAVA AND SQL programming skills are strictly required. Basic knowledge in linear algebra, numerical analysis, probability, and statistics are strongly recommended. Furthermore, it is highly advisable if students have already completed (or are currently enrolled in) a machine-learning course. Since the course will be offered in English, fluency in English is also required.