1
课程详述
COURSE SPECIFICATION
以下课程信息可能根据实际课需要或在课程检讨之后产生变动。如对课程有任何疑问,
联系授课教师。
The course information as follows may be subject to change, either during the session because of unforeseen
circumstances, or following review of the course at the end of the session. Queries about the course should be
directed to the course instructor.
1.
课程名称 Course Title
分布式存储与并行计算 Distributed storage and parallel computing
2.
授课院系
Originating Department
统计与数据科学系 Department of Statistics and Data Science
3.
课程编号
Course Code
STA321
4.
课程学分 Credit Value
3
5.
课程类别
Course Type
专业核心课 Major Core Courses
6.
授课学期
Semester
秋季 Spring
7.
授课语言
Teaching Language
中英双语 English & Chinese
8.
他授课教师)
Instructor(s), Affiliation&
Contact
For team teaching, please list
allinstructors
胡延庆, 统计与数据科学系. huyq@sustech.edu.cn
9.
验员/、所、联
方式
Tutor/TA(s), Contact
待公布 To be announced
10.
选课人数限额(可不)
Maximum Enrolment
Optional
2
11.
授课方式
Delivery Method
讲授
Lectures
实验/
Lab/Practical
其它(具体注明)
OtherPlease specify
总学时
Total
学时数
Credit Hours
48
48
12.
先修课程、其它学习要求
Pre-requisites or Other
Academic Requirements
计算机程序设计基础 Introduction to Computer Programming CS102
数据结构与算法分析 Data Structures and Algorithm Analysis CS203
13.
后续课程、其它学习规划
Courses for which this course
is a pre-requisite
14.
其它要求修读本课程的学系
Cross-listing Dept.
教学大纲及教学日历 SYLLABUS
15.
教学目标 Course Objectives
本课程在计算机程序设计、数据结构等课程的基础上,围绕大数据处理,让学生了解掌握目前分布式存储和并行计算的模
式与框架,初步掌握分布式编程的实践能力。
This course focuses on big data processing on the basis of computer programming, data structure and other courses to
make students understand and master the current distributed storage and parallel computing mode and framework, and
preliminarily master the practical ability of distributed programming.
16.
预达学习成果 Learning Outcomes
通过本课程的学习,学生预期可达到:
了解大数据管理的硬件和软件、系统体系结构、新的编程范式,以及并行分布式计算技术最新研究进展。
了解云计算的整体框架及关键实现技术、业务模式,掌握创建高性能集群和分布式编程实践能力。
On successful completion of the course, students should be able to:
Be familiar with the hardware and software of big data management, system architecture, new programming
paradigms, and the latest research progress of parallel distributed computing technology.
Understand the overall framework of cloud computing, key implementation technologies and business models, and
master the ability to create high-performance clusters and distributed programming practices.
17.
课程内容及教学日历 (如授课语言以英文为主,则课程内容介绍可以用英文;如团队教学或模块教学,教学日历须注明
主讲人)
Course Contents (in Parts/Chapters/Sections/Weeks. Please notify name of instructor for course section(s), if
this is a team teaching or module course.)
3
First Part: Introduction to distributed storage and parallel computing [4 hours]
> Basic concepts, motivations, current situation and development, application prospects [2 hours]
> Examples of parallel computing using Linux and Python [2 hours]
Second Part: Methods for distributed storage and parallel computing [18 hours]
> Infrastructures of distributed systems [2 hours]
> Map and Reduce for parallel computing [4 hours]
> Workload balance and scheduling [2 hours]
> Communication and synchronisation in parallel computing [4 hours]
> Transactions and locks [2 hours]
> Fault-tolerance, Byzantine fault, and Paxos/RAFT protocols [2 hours]
> Distributed file system for distributed storage (e.g. HDFS) [2 hours]
Third Part: Parallel computing in practice [14 hours]
> Data processing in multi-threads/multi-processes [4 hours]
Data crawling, cleaning, preprocessing [2 hours]
Experiments [2 hours]
> Hadoop and PySpark [4 hours]
Basic concepts and usages of Hadoop and PySpark [2 hours]
Experiments [2 hours]
> Machine learning with multiple GPUs [6 hours]
Clustering, regression, classification, collaborative filter [4 hours]
Experiments [2 hours]
Fourth Part: Distributed storage and parallel computing in the future [6 hours]
> Training LARGE neural networks : data parallelism, model parallelism and beyond [2 hours]
> Blockchain -- decentralized distributed system [4 hours]
4
18.
教材及其它参考资料 Textbook and Supplementary Readings
Textbook:
- Tomasz Drabas, Denny Lee. Learning PySpark: Build data-intensive applications locally and deploy at scale using the
combined powers of Python and Spark 2.0. Packt Publishing. 2017. Available in https://k-
state.instructure.com/files/7013786/download?download_frd=1
- Zaccone, Giancarlo. Python parallel programming cookbook. Packt Publishing Ltd, 2015.
https://docs.google.com/viewer?a=v&pid=sites&srcid=b2JqZWN0bWFnZS5jb218cHJpdmF0ZS10cmFpbmluZ3xneDoyZ
jU2M2U4NGJiN2M0NWU2
Supplementary Readings:
- PySpark tutorial: https://sparkbyexamples.com/pyspark-tutorial/
- Parallel computing. Stanford CS128, Fall 2021. https://gfxcourses.stanford.edu/cs149/fall21
- Wenqiang Feng. Learning Apache Spark with Python.
https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf or
https://runawayhorse001.github.io/LearningApacheSpark/
课程评 ASSESSMENT
19.
评估形式
Type of
Assessment
评估时间
Time
占考试总成绩百分比
% of final
score
违纪处罚
Penalty
备注
Notes
出勤 Attendance
课堂表现
Class
Performance
小测验
Quiz
课程项目 Projects
平时作业
Assignments
25
期中考试
Mid-Term Test
期末考试
Final Exam
50
期末报告
Final
Presentation
25
其它(可根据需要
改写以上评估方
式)
Others (The
above may be
modified as
necessary)
20.
记分方 GRADING SYSTEM
A. 十三级等级制 Letter Grading
课程审 REVIEW AND APPROVAL
5
21.
本课程设置已经过以下责任人/委员会审议通过
This Course has been approved by the following person or committee of authority