中国社会科学院大学《Python数据科学导论》课程大纲课程基本信息(CourseInformation)*学时*学分课程编号482(Course ID)(Credit Hours)(Credits)Python数据科学导论*课程名称(CourseName)IntroductiontoPythonDataScience先修课程(Prerequisite Courses)数据是数字、文本、图片、视频、对象、音频和其他实体形式的离散对象、事件和事实的集合。我们如何从数据中获取有意义的信息?这个问题的答案是探索性数据分析(EDA),它是调查数据集、阐明主题和可视化结果的过程。EDA是一种数据分析方法,它应用各种技术来最大限度地提高对数据集的特定洞察力、揭示底层结构、提取重要变量、检测异常值和异常、检验假设、开发模型并确定未来估计的最佳参数。本课程旨在提供有关EDA主要支柱的实用知识,包括数据清理、数据*课程简介准备、数据探索和数据可视化。以图形形式描绘数据使复杂的统计数据分析和商业智能更具市场价值。(Description)在这门课程中,学生有机会探索开源数据集,包括医疗保健数据集、人口统计数据集、泰坦尼克号数据集、葡萄酒质量数据集、汽车数据集、波士顿房价数据集等。使用这些现实生活中的数据集,将获得理解数据、总结数据特征和可视化数据以用于商业智能目的的实践练习。本课程使用pandas,这是一个用于处理数据的强大库,以及其他核心Python库,包括NumPy、scikit-learn、SciPy、StatsModels用于回归和Matplotlib用于可视化。Data is a collection of discrete objects,events,andfacts in theform ofnumbers, text,pictures,videos,objects,audio,and otherentities.Processing data providesa great dealofinformation.Butthemillion-dollarquestion is-howdowegetmeaningfulinformationfromdata?TheanswertothisquestionisExploratoryDataAnalysis(EDA),whichistheprocess of investigating datasets,elucidating subjects,and visualizing outcomes.EDAis anapproach to data analysis that applies a variety of techniques to maximize specific insightsinto a dataset, reveal an underlying structure, extract significant variables, detect outliersandanomalies,testassumptions,developmodels,anddeterminebestparametersforfutureestimations.This courseaimstoprovidepracticalknowledgeabout themainpillars*课程简介of EDA,includingdatacleansing,datapreparation,dataexploration,and data(Description)visualization.Whyvisualization?Well,severalresearchstudieshaveshownthatportrayingdataingraphicalformmakes complexstatisticaldataanalysesandbusinessintelligencemoremarketable.In this course,students will get theopportunityto exploreopen source datasets includinghealthcaredatasets, demographics datasets, a Titanic dataset, a wine quality dataset,automobile datasets,a Boston housing pricing dataset, and many others.Using thesereal-life datasets,youwill get hands-on practice in understandingdata, summarize data'scharacteristics, andvisualizing data for business intelligence purposes.This course usepandas, a powerful library for working with data, and other core Python libraries includingNumPy,scikit-learn,SciPy,StatsModelsforregression,andMatplotlibfor visualization
中国社会科学院大学《Python 数据科学导论》课程大纲 课程基本信息(Course Information) 课程编号 (Course ID) *学时 (Credit Hours) 48 *学分 (Credits) 2 *课程名称 (Course Name) Python 数据科学导论 Introduction to Python Data Science 先修课程 (Prerequisite Courses) *课程简介 (Description) 数据是数字、文本、图片、视频、对象、音频和其他实体形式的离散对象、事件和事实的集合。我 们如何从数据中获取有意义的信息?这个问题的答案是探索性数据分析 (EDA),它是调查数据集、 阐明主题和可视化结果的过程。 EDA 是一种数据分析方法,它应用各种技术来最大限度地提高对 数据集的特定洞察力、揭示底层结构、提取重要变量、检测异常值和异常、检验假设、开发模型并 确定未来估计的最佳参数。本课程旨在提供有关 EDA 主要支柱的实用知识,包括数据清理、数据 准备、数据探索和数据可视化。以图形形式描绘数据使复杂的统计数据分析和商业智能更具市场价 值。 在这门课程中,学生有机会探索开源数据集,包括医疗保健数据集、人口统计数据集、泰坦尼克号 数据集、葡萄酒质量数据集、汽车数据集、波士顿房价数据集等。使用这些现实生活中的数据集, 将获得理解数据、总结数据特征和可视化数据以用于商业智能目的的实践练习。本课程使用 pandas,这是一个用于处理数据的强大库,以及其他核心 Python 库,包括 NumPy、scikit-learn、 SciPy、StatsModels 用于回归和 Matplotlib 用于可视化。 *课程简介 (Description) Data is a collection of discrete objects, events, and facts in the form of numbers, text, pictures, videos, objects, audio, and other entities. Processing data provides a great deal of information. But the million-dollar question is —how do we get meaningful information from data? The answer to this question is Exploratory Data Analysis (EDA), which is the process of investigating datasets, elucidating subjects, and visualizing outcomes. EDA is an approach to data analysis that applies a variety of techniques to maximize specific insights into a dataset, reveal an underlying structure, extract significant variables, detect outliers and anomalies, test assumptions, develop models, and determine best parameters for future estimations. This course aims to provide practical knowledge about the main pillars of EDA, including data cleansing, data preparation, data exploration, and data visualization. Why visualization? Well, several research studies have shown that portraying data in graphical form makes complex statistical data analyses and business intelligence more marketable. In this course, students will get the opportunity to explore open source datasets including healthcare datasets, demographics datasets, a Titanic dataset, a wine quality dataset, automobile datasets, a Boston housing pricing dataset, and many others. Using these real-life datasets, you will get hands-on practice in understanding data, summarize data's characteristics, and visualizing data for business intelligence purposes. This course use pandas, a powerful library for working with data, and other core Python libraries including NumPy, scikit-learn, SciPy, StatsModels for regression, and Matplotlib for visualization
*教材《Hands-OnExploratoryDataAnalysiswithPython》,作者SureshKumarMukhiyaUsmanAhmed,出(Textbooks)版社:PacktPublishingLtd,国际标准书号ISBN:978-1-78953-725-3。参考资料《PracticalDataSciencewithPython》,作者:NathanGeorge,出版社:PacktPublishingLtd,国际标准书号ISBN:978-1-80107-197-0。(OtherReferences)*课程类别口公共基础课/全校公共必修课V通识教育课口专业基础课(CourseCategory)国专业核心课/专业必修课口专业拓展课/专业选修课口其他口线上,教学平台*授课对象*授课模式全校本科生v线下口混合式口其他(TargetStudents)(Modeof Instruction)口实践类(70%以上学时深入基层)v中文*开课院系*授课语言口全外语计算机教研部(School)口双语:中文+(Language of Instruction)(外语讲授不低于50%)徐卫克,1980年生,男,中国社会科学院大学计算机教研部教师,计课程负责人算社会科学研究中心成员,主要研究方向为数据分析、人工智能。教姓名及简介授课程有《大学计算机》《Python数据分析》《Python深度学习》*授课教师信息(Teacher Information)《Python编程导论》、《数据获取与网络爬虫》等。团队成员无姓名及简介掌握各种数据可视化方法:1.掌握各种数据处理方法,例如加载、清洗、转换等;2.3.掌握各种基本统计方法;学习目标4.掌握各种不同的数据分组机制;(Learning学会如何分析数据之间的相关性:5.Outcomes)理解时间序列数据;6.7.掌握假设检验和回归;8.掌握机器学习模型开发与评估:*考核方式平时成绩30%,期末成绩70%(Grading)*课程教学计划(TeachingPlan)其中周其教学内容摘要课习学周次讲程(必含章节名称、讲述的内容提要、实验的名称、教学方法、课堂讨论的题目、他题验时授讨阅读文献参考书目及作业等)环课论节Section1:TheFundamentalsof EDA1ExploratoryDataAnalysisFundamentals1.1 Understanding data science33第一周1.2ThesignificanceofEDA1.2.1 Steps in EDA1.3Makingsenseofdata1.3.1Numerical data
*教材 (Textbooks) 《Hands-On Exploratory Data Analysis with Python》,作者: Suresh Kumar Mukhiya Usman Ahmed,出 版社: Packt Publishing Ltd,国际标准书号 ISBN:978-1-78953-725-3。 参考资料 (Other References) 《Practical Data Science with Python》,作者: Nathan George,出版社: Packt Publishing Ltd,国际标准 书号 ISBN:978-1-80107-197-0。 *课程类别 (Course Category) 公共基础课/全校公共必修课 √通识教育课 专业基础课 专业核心课/专业必修课 专业拓展课/专业选修课 其他 *授课对象 (Target Students) 全校本科生 *授课模式 (Mode of Instruction) 线上,教学平台 √线下 混合式 其他 实践类(70%以上学时深入基层) *开课院系 (School) 计算机教研部 *授课语言 (Language of Instruction) √中文 全外语 双语:中文+ (外语讲授不低于 50%) *授课教师信息 (Teacher Information) 课程负责人 姓名及简介 徐卫克,1980 年生,男,中国社会科学院大学计算机教研部教师,计 算社会科学研究中心成员,主要研究方向为数据分析、人工智能。教 授课程有《大学计算机》、《Python 数据分析》、《Python 深度学习》、 《Python 编程导论》、《数据获取与网络爬虫》等。 团队成员 姓名及简介 无 学习目标 ( Learning Outcomes) 1. 掌握各种数据可视化方法; 2. 掌握各种数据处理方法,例如加载、清洗、转换等; 3. 掌握各种基本统计方法; 4. 掌握各种不同的数据分组机制; 5. 学会如何分析数据之间的相关性; 6. 理解时间序列数据; 7. 掌握假设检验和回归; 8. 掌握机器学习模型开发与评估; *考核方式 (Grading) 平时成绩 30%,期末成绩 70% *课程教学计划(Teaching Plan) 周次 周 学 时 其中 教学内容摘要 (必含章节名称、讲述的内容提要、实验的名称、教学方法、课堂讨论的题目、 阅读文献参考书目及作业等) 讲 授 实 验 课 习 题 课 课 程 讨 论 其 他 环 节 第一周 3 3 Section 1: The Fundamentals of EDA 1 Exploratory Data Analysis Fundamentals 1.1 Understanding data science 1.2 The significance of EDA 1.2.1 Steps in EDA 1.3 Making sense of data 1.3.1 Numerical data
1.3.2 Categorical data1.3.3Measurement scales1.4Comparing EDAwithclassical and Bayesian analysis1.5Softwaretoolsavailablefor EDASection1:TheFundamentalsofEDA1.6Getting startedwithEDA第二周31.6.1 Python Basic operations31.6.2Numpy1.6.3PandasSection 1:TheFundamentals of EDA1.6.2NumpyquickstartThe Basics第三周33ShapeManipulationCopies and ViewsLess BasicAdvanced indexingandindextricksSection1:TheFundamentalsof EDA2Visual Aids for EDA2.1 Technical requirements2.2 Line Chart2.3Bar Charts2.4Scatter Plot2.4.1Bubblechart2.4.2Scatter plot using seaborn第四周332.5 Area plot and Stacked Plot2.6Pie chart2.7Table Chart2.8Polar chart2.9Histogram2.10Lollipopchart2.11Choosingthe bestchart2.12OtherlibrariestoexploreSection 1: The Fundamentals of EDA3EDA with Personal Email3.1 Technical requirements3.2 Loading the dataset3.3Data transformation3.3.1Data cleansing3.3.2LoadingtheCSVfile第五周33.3.3Converting the date3.3.4Removing NaN values3.3.5Applying descriptive statistics3.3.6Parse utf8 encoding3.3.7Data refactoring3.3.8Droppingcolumns3.3.9Refactoringtimezones
1.3.2 Categorical data 1.3.3 Measurement scales 1.4 Comparing EDA with classical and Bayesian analysis 1.5 Software tools available for EDA 第二周 3 3 Section 1: The Fundamentals of EDA 1.6 Getting started with EDA 1.6.1 Python Basic operations 1.6.2 Numpy 1.6.3 Pandas 第三周 3 3 Section 1: The Fundamentals of EDA 1.6.2 Numpy quickstart The Basics Shape Manipulation Copies and Views Less Basic Advanced indexing and index tricks 第四周 3 3 Section 1: The Fundamentals of EDA 2 Visual Aids for EDA 2.1 Technical requirements 2.2 Line Chart 2.3 Bar Charts 2.4 Scatter Plot 2.4.1 Bubble chart 2.4.2 Scatter plot using seaborn 2.5 Area plot and Stacked Plot 2.6 Pie chart 2.7 Table Chart 2.8 Polar chart 2.9 Histogram 2.10 Lollipop chart 2.11 Choosing the best chart 2.12 Other libraries to explore 第五周 3 3 Section 1: The Fundamentals of EDA 3 EDA with Personal Email 3.1 Technical requirements 3.2 Loading the dataset 3.3 Data transformation 3.3.1 Data cleansing 3.3.2 Loading the CSV file 3.3.3 Converting the date 3.3.4 Removing NaN values 3.3.5 Applying descriptive statistics 3.3.6 Parse utf8 encoding 3.3.7 Data refactoring 3.3.8 Dropping columns 3.3.9 Refactoring timezones
3.4Dataanalysis3.4.1Numberofemails3.4.2Time of day3.4.3Average emails per day and hour3.4.4Numberofemailsperday3.4.55MostfrequentlyusedwordsSection 1:The Fundamentals of EDA4DataTransformationPP4.1 Technical requirements4.2Background4.3Merging database-style dataframes4.3.1Concatenating along with an axis4.3.2Usingdf.mergewithaninnerjoin4.3.3Usingthepd.merge()methodwithaleftjoin4.3.4Using the pd.merge()method with a right join4.3.5Using pd.merge()methods with outer join4.3.6Merging on index第六周334.3.7Reshaping and pivoting4.4Transformationtechniques4.4.1Performing data deduplication4.4.2Replacing values4.4.3Handlingmissingdata4.4.4Renaming axis indexes4.4.5Discretization and binning4.4.6Outlier detection and filtering4.4.7Permunationand Random sampling4.4.8Computingindicators/dummyvariables4.4.9StringmanipulationSection 1:The Fundamentals of EDA4.510minutestopandasBasic data structures in pandasObject creationViewingdataSelectionMissing data第七周3Operations3MergeGroupingReshapingTime seriesCategoricalsPlottingImporting and exporting data
3.4 Data analysis 3.4.1 Number of emails 3.4.2 Time of day 3.4.3 Average emails per day and hour 3.4.4 Number of emails per day 3.4.5 Most frequently used words 第六周 3 3 Section 1: The Fundamentals of EDA 4 Data Transformation 4.1 Technical requirements 4.2 Background 4.3 Merging database-style dataframes 4.3.1 Concatenating along with an axis 4.3.2 Using df.merge with an inner join 4.3.3 Using the pd.merge() method with a left join 4.3.4 Using the pd.merge() method with a right join 4.3.5 Using pd.merge() methods with outer join 4.3.6 Merging on index 4.3.7 Reshaping and pivoting 4.4 Transformation techniques 4.4.1 Performing data deduplication 4.4.2 Replacing values 4.4.3 Handling missing data 4.4.4 Renaming axis indexes 4.4.5 Discretization and binning 4.4.6 Outlier detection and filtering 4.4.7 Permunation and Random sampling 4.4.8 Computing indicators/dummy variables 4.4.9 String manipulation 第七周 3 3 Section 1: The Fundamentals of EDA 4.5 10 minutes to pandas Basic data structures in pandas Object creation Viewing data Selection Missing data Operations Merge Grouping Reshaping Time series Categoricals Plotting Importing and exporting data
Section1:TheFundamentalsofEDA5GroupingDatasets5.1 Technical requirements5.2Understanding groupby()5.3Groupbymechanics5.3.1 Selecting a subset of columns5.3.2Maxandmin第八周335.3.3Mean5.4 Data aggregation5.4.1Group-wiseoperations5.4.2Group-wisetransformations5.5 Pivottables and cross-tabulations5.5.1Pivottables5.5.2Cross-tabulationsSection 2:Descriptive Statistics6Descriptive Statistics6.1 Technical requirements6.2 Understanding statistics6.2.1 Distribution function6.2.2Cumulativedistributionfunction6.2.3Descriptive statistics6.3Measuresofcentraltendency第九周36.3.1 Mean/average36.3.2Median6.3.3Mode6.4 Measures of dispersion6.4.1 Standard deviation6.4.2Variance6.4.3Skewness6.4.4Kurtosis6.4.5 Calculating percentilesSection2:Descriptive Statistics6.5Makingpredictionsusing thecentral limittheoremand SciPy6.5.1Manipulating the normal distribution using SciPy第十周336.5.2 Determining the mean and variance of a population throughrandom sampling6.5.3Making predictionsusing themeanand varianceSection2:DescriptiveStatistics7Correlation7.1 Technical requirements7.2 Introducing correlation第十一周337.3Types ofanalysis7.3.1 Understanding univariate analysis7.3.2Understandingbivariateanalysis7.3.3Understandingmultivariateanalysis7.4Discussing multivariate analysis using the Titanic dataset
第八周 3 3 Section 1: The Fundamentals of EDA 5 Grouping Datasets 5.1 Technical requirements 5.2 Understanding groupby() 5.3 Groupby mechanics 5.3.1 Selecting a subset of columns 5.3.2 Max and min 5.3.3 Mean 5.4 Data aggregation 5.4.1 Group-wise operations 5.4.2 Group-wise transformations 5.5 Pivot tables and cross-tabulations 5.5.1 Pivot tables 5.5.2 Cross-tabulations 第九周 3 3 Section 2: Descriptive Statistics 6 Descriptive Statistics 6.1 Technical requirements 6.2 Understanding statistics 6.2.1 Distribution function 6.2.2 Cumulative distribution function 6.2.3 Descriptive statistics 6.3 Measures of central tendency 6.3.1 Mean/average 6.3.2 Median 6.3.3 Mode 6.4 Measures of dispersion 6.4.1 Standard deviation 6.4.2 Variance 6.4.3 Skewness 6.4.4 Kurtosis 6.4.5 Calculating percentiles 第十周 3 3 Section 2: Descriptive Statistics 6.5 Making predictions using the central limit theorem and SciPy 6.5.1 Manipulating the normal distribution using SciPy 6.5.2 Determining the mean and variance of a population through random sampling 6.5.3 Making predictions using the mean and variance 第十一周 3 3 Section 2: Descriptive Statistics 7 Correlation 7.1 Technical requirements 7.2 Introducing correlation 7.3 Types of analysis 7.3.1 Understanding univariate analysis 7.3.2 Understanding bivariate analysis 7.3.3 Understanding multivariate analysis 7.4 Discussing multivariate analysis using the Titanic dataset