- Introduction
- Project Overview
- Data Selection and Preparation
- Feature Engineering and Selection
- Model Building and Evaluation
- Key Findings and Insights
- Real-World Application and Impact
- Challenges and Learnings
- Future Work and Improvements
In the highly competitive retail industry, accurate pricing strategies are crucial for maintaining profitability and market relevance. H&M, a global leader in fashion retail, has a vast inventory of articles, each with varying characteristics such as color and type. These attributes can significantly influence customer perceptions and purchasing decisions, ultimately affecting the product's price.
To enhance pricing strategies, this project aims to develop a predictive model that estimates the prices of H&M store articles based on their color and type. By leveraging machine learning techniques, we intend to uncover patterns and relationships within the data that can provide valuable insights into price determination.
This project aims to predict the prices of H&M articles based on their characteristics, specifically color and type. By developing an accurate predictive model, H&M can optimize its pricing strategy, improve inventory management, and enhance customer satisfaction.
- Objective: To build a robust machine learning model that accurately predicts article prices based on color and type.
- Potential Impact: Improved pricing strategies, optimized inventory management, and enhanced customer satisfaction.
- Source: H&M article dataset containing characteristics and prices.
- Key Characteristics: Article color, type, and price.
- Handled missing values and ensured data consistency.
- Applied one-hot encoding for categorical variables.
- Created new features from existing data to enhance model performance.
- Utilized PCA for dimensionality reduction to retain essential information while reducing feature set complexity.
- Selected features based on their relevance to the target variable (price).
- K-Nearest Neighbors (KNN)
- Random Forest
- Gradient Boosting
- Bagging
- Linear Regression
- Metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), R-squared (R²)
- Validation: Train-test split, cross-validation
- Enhanced model performance through systematic hyperparameter tuning.
- Random Forest: Best performance with the lowest MAE and MSE and highest R² score.
- Linear Regression, Gradient Boosting, Bagging: Also performed well but slightly less accurate than Random Forest.
- KNN: Has the lowest R² score, indicating it explains less variance compared to the other models.
- Ensemble methods like Random Forest and Bagging are highly effective for this type of regression problem.
- Can be applied to optimize pricing strategies in H&M stores.
- Potential to improve inventory management and marketing strategies.
- Enhanced profitability through optimized pricing.
- Better customer satisfaction due to more accurate and fair pricing.
- Ensure pricing strategies do not lead to unfair pricing or discrimination.
- Handling large and complex datasets.
- Ensuring data consistency and quality.
- Importance of data preprocessing and feature engineering.
- Value of ensemble methods in improving model accuracy.
- Explore additional features that may impact pricing.
- Experiment with more advanced machine learning techniques.
- Incorporate more sophisticated feature engineering techniques.
- Utilize larger datasets for training to improve model robustness.
- The Random Forest model has the lowest MAE (0.012079), followed closely by the Bagging and Gradient Boosting models. This indicates that these models have the smallest average errors in their predictions.
- The KNN model has a slightly higher MAE (0.013167), but it is still reasonably close to the other top models.
- The Linear Regression model performs comparably to the ensemble models with an MAE of 0.012232.
- Similar to the MAE results, the Random Forest and Bagging models have the lowest MSE (0.000475), suggesting they have the smallest average squared errors.
- The Gradient Boosting model follows closely with an MSE of 0.000490.
- The KNN model has a slightly higher MSE (0.000539), but it is still close to the top-performing models.
- The Linear Regression model has an MSE of 0.000482, performing better than KNN and close to the top models.
- The Random Forest model has the highest R2 score (0.292370), meaning it explains the highest proportion of the variance in the target variable.
- The Bagging model follows closely with an R2 score of 0.291828, and the Gradient Boosting model has an R2 score of 0.269941.
- The KNN model has a lower R2 score (0.197233), indicating it explains less variance compared to the other top models.
- The Linear Regression model performs well with an R2 score of 0.282263.
-
The Random Forest model is the best-performing model overall, with the lowest MAE and MSE and the highest R2 score.
-
Bagging and Gradient Boosting models also perform well, showing competitive MAE, MSE, and R2 scores.
-
The KNN model performs reasonably well but is slightly less accurate compared to the ensemble models.
-
The Linear Regression model performs comparably to the ensemble models, making it a viable option for prediction.
-
This analysis suggests focusing on ensemble methods like Random Forest and Bagging for better prediction accuracy in this dataset.
-
Team Members: Dalreen Soares, Daniela Trujillo, Lāsma Oficiere

