Screen Australia Data Analysis
The Australian film market faces many challenges in the post-epidemic era. This project aims to accurately predict movies' lifetime box office revenue through machine learning models and provide data support for Screen Australia to formulate effective box office maximization strategies.
Introduction
Screen Australia is an Australian government agency established in 2008 to support the development and production of local Australian films and television content, including films, TV series and documentaries. The agency's main objective is to provide a platform for creative talents in Australia to assist them in the creation of film and television content.
However, with the significant impact of the covid-19, the film industry in Australia is facing unprecedented challenges. To address this problem accurately, this project aims to help movies predict the lifetime box office of movies released in the post-covid-19 period by building and analyzing advanced machine learning models. The data used in the report is provided by Screen Australia and the time range is from January 2010 to February 2023.
Data Processing
I processed 7,093 entries of data across 29 variables. By addressing missing values, converting data, and handling outliers, I ensured data integrity and the accuracy of our analysis.
Details of Data Processing
To address data issues, I first calculated the proportion of missing values in each variable and removed those with over 50% missing data, considering them missing completely at random (MCAR).
For variables like 'Preview Gross' with high missing rates but strong correlations, I filled the missing values using the median after testing different methods.
Categorical variables like 'Actor1' and 'Actor2' had missing values replaced with 'NA' to avoid bias, while other numerical variables with low missing rates were removed.
I also converted data types to align with their characteristics, such as transforming 'Lifetime Gross' with a log function to correct its skewed distribution.
For outliers, I defined a threshold based on industry data and removed outliers to ensure accuracy in the analysis.
Through these steps, I aimed to minimize bias and improve the reliability of the analysis.
Exploratory Data Analysis
In this section, I delved into the dataset to grasp its structure, spot patterns, and identify any anomalies. By analysing numerical, categorical, and time-series variables, I uncovered key factors that influence lifetime gross, such as the strong correlation between preview gross and lifetime gross, as well as the significance of distributor and genre. I also observed shifts in audience preferences post-pandemic, with M-rated movies and locally produced films becoming more popular. Additionally, the time-series analysis showed that while most genres saw a decline, the Horror genre grew during and after the Covid-19 pandemic, highlighting a potential area for future investment.
Exploratory Data Analysis Highlight
Relationships among the variables
Lifetime Gross has a strong correlation with Opening Week Gross, Opening Weekend Gross, Opening Day Screens, Opening Day Gross, Opening Weekend Screens, and Opening Week Screens.
P Value of each Feature
All variables except "% of Opening Weekend to Week," "month," and "day" exhibit significant correlations with Lifetime Gross.
The Correlation between Each Variable and Lifetime Gross across Time Period
The impact of Preview Gross on Lifetime Gross is gradually intensifying.
Feature Engineering
To estimate the performance of machine-learning models on unseen data, I applied an 80-20 train-test split to the Screen Australia dataset. For handling categorical variables, I used target encoding while carefully avoiding target leakage by employing smoothing techniques. To improve the linear regression model, I addressed multicollinearity issues identified through Variance Inflation Factor (VIF) analysis. Strategies to mitigate multicollinearity included excluding correlated predictors, transforming variables, or using regularization techniques like Lasso or Ridge regression, which help reduce the impact of multicollinearity while maintaining model accuracy.
Model Comparison and Selection
I employed various models, including linear regression, decision trees, random forests, XGBoost, and model stacking. Comparing metrics like R2 and RMSLE, random forests and model stacking outperformed in terms of prediction accuracy and stability.
Conclusions and Recommendations
Based on my model analysis, I recommend the following:
1. Focus on the popularity of directors and actors. The prominence of directors and actors plays a crucial role in influencing box office revenue. Screen Australia should prioritize engaging well-known directors and actors during film production and promotion to boost audience interest and attention.
2. Prioritize premiere week screenings. The number of screenings during the premiere week substantially impacts a movie's box office performance. Screen Australia should collaborate with cinemas to secure a higher number of screenings during the premiere week, drawing in larger audiences.
3. Develop targeted publicity strategies. Utilizing random forest and stacked models for predicting box office outcomes enables the formulation of tailored promotional recommendations. By leveraging these predictions, Screen Australia can identify movies that require additional promotion to stimulate audience interest.
4. Maintain ongoing tracking and optimization of forecasting models. As market dynamics and audience preferences evolve, forecasting models may require adjustments. Screen Australia should regularly review and optimize these models to ensure they accurately capture market trends when projecting box office performance.
By implementing these recommendations, Screen Australia can support the Australian film industry's growth in the post-pandemic period, increase box office revenue, and contribute to the flourishing Australian film market.