Using Statistics and Supervised Machine Learning to Inform Airline Decision Making
Audience
Course module will be deployed Spring ‘23, near the end of the semester.
Course module is being deployed for a Random Signal Analysis course, in which students are introduced to probabilistic and statistical concepts applied to electrical systems and signals.
The expected class size is around 70 students.
Students are assumed to have some working knowledge of MATLAB coding. The module also assumes students have covered some basic probability (e.g., normal distributions) and statistical (e.ghypothesis testing) concepts in their class.
Other applicable courses that may benefit from this module: The module is useful for anyone interested in applying probabilistic and statistical concepts to a real dataset and obtaining an introduction to machine learning concepts and their application.
Project Summary
Primary Objective
This module’s aim is to reinforce the statistical concepts introduced in class through the process of building a data analysis pipeline. Statistical concepts are first implemented to gain an understanding of the data, and this understanding is then used to implement machine learning models. The module is split into two assignments: the first part has students work with the dataset to gain insight into its variables and underlying relationships (exploratory data analysis), and the second part has students implement supervised machine learning models to generate predictions that may inform the decision making of airlines. The machine learning assignment provides additional introductions to the underlying concepts and focuses on interpreting the results rather than coding the models by providing function wrappers to expedite the process.
Goals
Understand the steps of exploratory data analysis and what conclusions can be drawn about the relationships in the data
Understand the difference between regression and classification and the conclusions that can be drawn from each
Gain comfort applying the aforementioned concepts to a real-world dataset and interpreting and presenting the results in an informative manner
Content Outline
Part 1
Key Concepts: univariate analyses [histogram, boxplot, qq-plot, mean, standard deviation, five-number summary, normal distribution], bivariate/multivariate analyses [correlation matrix, scatter plot, hypothesis test]
Students will complete a MATLAB live script file and accompanying Canvas quiz
Part 2
Key Concepts: Supervised learning [Regression (linear regression, parsimonious model), classification (support vector machine), evaluation metrics (r-squared, accuracy)]
Students use a MATLAB live script file to analyze the relationship between airline carrier delays and passengers per flights
Part 2 is delivered in the form of a “Case Study”, wherein teams of students analyze the data using functions from Part 1 and machine-learning models from Part 2
The final deliverable of Part 2 is a 10-page slidedeck recommending a specific course of action, based on the data, for the client (Phoenix Sky Harbor International Airport)
For more information email: difuse-pi-group@dartmouth.edu