This project involves using data mining in SAS Enterprise Miner to establish a model that will forecast airline passengers’ satisfaction level. It originally has 22 variables with 10,000 records captured from studying flights and service quality of an airline organization which includes flight related attributes such as distance, delay and service quality such as seat comfort and online check in. The dependent variable which was used in this study was dichotomised into two categories as Satisfied and Not Satisfied. The two algorithms used in modeling after data exploration and data preparation include Decision Tree and Neural Network. In order to improve all of these models, each model was altered several times over. The performances of the models were compared by using ROC Curves, Cumulative Lift Chart and the classification accuracy of the constructed models on training, validation and test datasets. This showed that the Neural Network had a clearer generalization and a better capability of predicting outcomes than the Decision Tree. The identified and prioritized aspects which influence the satisfaction level of passengers include boarding through the internet, entertainment on the flight, and the seats. These results will be valuable for airlines to enhance the customers’ experiences based on the perceived importance of the service attributes. The project demonstrates how ‘big data’ can be used in improving service delivery in the airline business.
Samples and reference materials guide students in mastering assignment structures and enhancing academic skills. We provide assignment writing help while maintaining originality. The Data Mining Lab Report Airline Satisfaction Assignment Sample demonstrates the SEMMA framework, Neural Network vs Decision Tree models, and ROC analysis. this sample resource serves purely for learning and reference.
In the airline industry, it is essential to share some tips on boosting the satisfaction of the customers so as to ensure that they remain loyal. Customer satisfaction in this case specifically refers to the satisfaction of the passengers in a given airline company as it determines their reputation, customer loyalty as well as the level of cash flow in the business. This means that, as the stakeholder’s expectations increase over time, it is important for airline industries to note that there are some customer service attributes that determine customer satisfaction. This project fills that gap by using past records of delays, comfort and service quality to determine satisfaction results. This way, valuable information exists to aid in enhancing services which are most important to air travel consumers. They are useful to inform on the decisions of service enhancements, customization, and customer loyalty. Overall, the project also contributes as a value and a competitive advantage in a service-oriented industry to airlines.
In this project, the use of SAS Enterprise Miner was done utilizing SEMMA (Sample, Explore, Modify, Model, Assess) technique.
Sample: In the present study, the dataset was made of 10,000 observations and 22 variables. To ensure that the performance measure is not affected by the data split, the data was split 70/30 in a train-test applying random seed.
Explore:This included handling for the missing data, outliers and in equal distribution of classes. The sentiments of passengers were mixed, yet there were more dissatisfied travelers (Ewadh and Al-Azawei, 2023). To assess the measurement levels of roles, the Metadata node was consulted.
Modify: Missing values were imputed, categorical features were encoded and some of them were excluded. In this process input and target roles were defined.
Models: Decision Tree and Neural Network Models were created as variations of tree depth, split rules in decision tree, hidden layers, and activation functions.
Assess: The performance of the classifiers was measured based on three metrics namely the ROC index, classification accuracy and Cumulative Lift. Here, the Neural Network ranked the highest among all the four classification algorithms.
Model Roles
In this project, variables were assigned certain functions in order to improve the training of the model. The dependent variable was chosen to be Satisfaction, a nominal variable (Satisfied vs Not Satisfied) as it reflects the project’s aim to assess potential customer satisfaction. These comprised Age, Flight Distance, Departure Delay, Arrival Delay, and a number of Service Quality variables; for instance, Online Boarding, Seat Comfort and Inflight Service. These inputs were selected with regard to how they can impact the passengers (Jadhav, 2023). Certain variables which were generated by the system or which had missing/unrelated information were not used in the final model to reduce the noise level.
Data Types/Measurement Levels: The variables were properly identified with regard to their types and measurement level. Meanwhile, Age, Flight Distance, and Delays were assumed as interval data. In this case, features such as Seat Comfort, Food and Drink were assigned as ordinal while other features like, Gender and Class were categorized as nominal.
Dataset Balance: The target variable Satisfaction can be said to have a moderate class imbalance problem with more of the passenger being dissatisfied. However, it was not to the extent that it required the study to be resampled.
Missing Data: The dependent variables were derived from several service rating variables and these contained missing values. This was assumed and done with the help of SAS means (for interval variables) and modes (for nominal variables) to make sure that all values are used in the modeling.
Descriptive Statistics:
Descriptive analysis showed that the variables such as Flight Distance had high variability and positive skewness (Lee et al., 2024). The service ratings were distributed in a way that placed it between the midway and high region, thereby pointing toward positive sentiments.
Outliers: Distance in flight and delays were found to have outliers but were not removed because they are real life situations that can occur in airline operations.
Multicollinearity: The correlation analyses revealed a moderate level of association between some of the mentioned service features and, therefore, they were not considered to be fully removed.
Sampling Method: In the process of data division, the random partition assigned 70% of the students to the training set and the remaining 30% to the test dataset.
The process of data modification was important in making sure that the data set gathered contained good quality information that was standardized for use in making predictions.
Data Imputation: Ordinal data was used thus, categorical data dealt with through the process of imputations. With regard to Age, Flight Distance, and Arrival Delay which are numeric in nature, median was used in order to minimize the effects of outliers. In the case of Gender and Customer Type, the mode was used to retain the essence of categorical data and balance the data distribution.
Data Filtration: Outliers were defined through inspection by looking at the data through descriptive statistics. Values that included high and low in Flight Distance and Departure Delay were excluded from analysis since they greatly influence variance and model stability (Yao, 2023). These values were shed off to enhance model accuracy without compromising the size of the dataset considerably.
Data Transformation: The nominal variables such as Gender and Customer Type were discrete and were, therefore, transformed into numerical format with two categories in order to be used in the model. This transformation made it possible for the algorithms to handle categorical data which are not in numbers. Service rating variables were kept in the context of the proposed classification as ordinal variables and, thus, retained their original scale ranging from 0 to 5.
These adaptations helped in improving both the interpretability of models and accuracy in addition to preserving some hierarchy in the model as in the original dataset.
Several versions of the Neural Network version were delivered a good way to decorate the prediction overall performance and version quantity. The first variant adopted a simple shape with one hidden layer with 10 neurons in view of widespread structure. The 2d exchange introduced two new layers, the primary layer having 10 neurons and the second one layer having 5 neurons which helped in depth within the styles. In the 1/3 variant, the ReLU activation feature turned into used to add non-linearity and to additionally keep away from vanishing gradient troubles (AlHabbal, 2022). The fourth variant integrated the Sigmoid activation feature, that is regularly applied on binary categories as this one. These adjustments allowed us to compare the depth of the model, the effect of activation, in addition to the convergence among training and validation datasets.
Development of Models
The Neural Network model was additionally fine-tuned with the aid of modifying its shape and the capabilities that had been used for the activation of the neurons. Concerning versions, there has been an emphasis on the depth, the wide variety of neurons in a layer, and the activation strategies. These changes have been achieved to evaluate the effectiveness on getting to know capacity, popularity charge, and on the unseen records of validation and check sets.
Table 3: Neural Network Model Variations and Their Rationale
| Model Variation | Reason |
|---|---|
| 1 Hidden Layer (10 neurons) | Baseline model to establish core performance. |
| 2 Hidden Layers (10-5 neurons) | To increase depth and capture complex patterns in service features. |
| ReLU Activation Function | To test improved convergence and non-linear learning capabilities. |
| Sigmoid Activation Function | Traditional function suited for binary classification tasks. |
Table 3: Neural Network Model Variations and Their Rationale
Overfitting Analysis
This ensured that the version exceeded all the rectification checks with a enormously low stage of overfitting. The ROC curves and cumulative lift charts for all the 3 distinct datasets which include education, validation and test have been extra or less comparable, signifying that the version became acting quite well.
Table 4: Neural Network Models Performance
| Model Variation | ROC Index | Cu. Lift | Scope % | True – % | False – % | True + % | False + % |
| 1 Hidden Layer (10 neurons) | 0.93 | 3.0 | 91% | 89.5% | 5.2% | 88.3% | 6.0% |
| 2 Hidden Layers (10-5 neurons) | 0.94 | 3.1 | 92% | 90.1% | 4.7% | 89.5% | 5.7% |
| ReLU Activation | 0.92 | 2.9 | 89% | 87.6% | 6.3% | 86.2% | 7.4% |
| Sigmoid Activation | 0.91 | 2.8 | 88% | 86.4% | 7.0% | 85.1% | 8.5% |
Four versions of Decision Trees have been designed to distinguish the opportunities in relation to cut up standards and complexity of the tree. Binary and multiway splits have been adopted on the way to determine the impact that node branching has on the classification (Maqbool et al., 2024). Shallow trees with intensity 3 helped to save you overfitting, and deep bushes with the intensity of 10 have been used to capture extra complex patterns within the passenger pride conduct.
Development of Models
Using four versions for the Decision Tree model, it observed the effect of various tree and types of break up at the accuracy of category. These modifications entailed change of break up technique, depth of the tree as well as the pruning. These versions enabled the understanding of version complexity and overfitting, as well as the capability to interpret tree-based consequences regarding passengers’ delight.
Table 5: Decision Tree Model Variations and Their Rationale
| Model Variation | Reason |
| Binary Split (Entropy) | To simplify the tree structure with binary decisions for each node. |
| Multiway Split (Chi-Square) | To allow multiple branches at splits and capture categorical variety. |
| Depth Limit = 3 | To create a shallow tree and reduce risk of overfitting. |
| Depth Limit = 10 (No Prune) | To explore deep trees for capturing complex patterns. |
Table 6: Decision Tree Models Performance
| Model Variation | ROC Index | Cu. Lift | Scope % | True – % | False – % | True + % | False + % |
| Binary Split (Entropy) | 0.89 | 2.8 | 88% | 85.4% | 7.8% | 84.2% | 8.9% |
| Multiway Split (Chi-Square) | 0.90 | 2.9 | 89% | 86.7% | 6.5% | 85.3% | 8.5% |
| Depth = 3 (Shallow Tree) | 0.87 | 2.6 | 86% | 84.0% | 9.0% | 82.3% | 10.0% |
| Depth = 10 (No Prune) | 0.91 | 3.0 | 90% | 88.5% | 5.9% | 87.6% | 6.5% |
Table 7: Summary Results of the Best Performing Models
| Model | ROC Index | Cu. Lift | Scope % | True – % | False – % | True + % | False + % |
| Neural Network | 0.94 | 3.1 | 92% | 90.1% | 4.7% | 89.5% | 5.7% |
| Decision Tree | 0.91 | 3.0 | 90% | 88.5% | 5.9% | 87.6% | 6.5% |
The Neural Network was better than the Decision Tree in all the tested measures. It has obtained a greater ROC index, greater cumulative lift, and lesser misclassification ratio in both the training, validation and test datasets. Although the Decision Tree provided more interpretability, it was less accurate and did not generalize well as compared to the Neural Network. ROC and lift charts were accurate for the given data splits as well (Ramadhan and Putrada, 2023). Therefore, through the results obtained, the Neural Network was chosen as the most appropriate model in this case of estimating passenger satisfaction with a high level of reliability.
Data mining remains to be an important tool in different fields as it helps to analyze big and intricate sets of data. In the business field, it serves customer segmentation, fraud detection, inventory, and demand forecasting for customized marketing. In the field of medicine, models are used in disease prognosis, medical prognosis, and risk assessment of patients (Jiang et al., 2022). Most of the finance sector applies data mining in credit scoring, portfolio assessment and risk management. Likewise, education institutions use it to forecast the student outcomes and minimize the cases of dropouts.
In the current data mining developments, machine learning and deep learning techniques take large amounts of data and make new patterns out of it from texts, images and videos among others. One of the major developments today is AutoML or Automated Machine Learning which actually make model building and selection easier and as such, data mining becomes more accessible. In the case of the airline industry, data mining is shifting the management of customer experience through services and pricing. With an abundance of data in the future, what is the role of professionals, as well as artificial intelligence, and the merging of the two in developing ways to turn data into predictive models for more impactful, accurate and understandable results.
In this project the actual prediction on passenger satisfaction has been made by the help of data mining techniques on an actual data set with 10000 records of customers. Based on the data preparation, two models, Decision Tree and Neural Network, have been constructed and tested. According to the results of the ROC, the lift and overall accuracy, the Neural Network surpassed the performance of the Decision Tree. The most frequently mentioned factors for satisfaction in this segment were boarding through the internet, facilities of the plane during the flight, and seating comfort.
This project has helped me to gain more knowledge on data mining and how it can be applied when solving real-life problems. Using SAS Enterprise Miner helped me to develop my technical abilities, especially when assessing models. Some of the difficulties I encountered were at the time of setting up metadata and as far as interchanging roles are concerned but the problems were solved and it increased my confidence level (Rosmita et al., 2024). Speaking of the general impression, one can emphasize that the received experience was useful, organized, and quite beneficial.
10. Reference List
Journals
1. Introduction: A Research Critique on Healthcare Strategies The primary goal of delivering the criticism task is to examine...View and Download
Introduction Get free samples written by our Top-Notch subject experts for taking online;Assignment Help;services. 1.1 Brief...View and Download
Introduction to Work and the Employment Relationship Assignment The primary roles of stakeholders focus on developing plans and...View and Download
Introduction Professionalism in Adult Nursing context encompasses with ensuring high-quality care to patients, upholding values...View and Download
Introduction: Best Leadership Approaches for Organizational Change As a trainee manager of the Domino, different approaches and...View and Download
Introduction Get free samples written by our Top-Notch subject experts for taking online Assignment...View and Download