Predicting Apple’s Future: Testing 6 Forecasting Models to Uncover Market Trends (2018–2023)
Header: Finding the best prediction model out of SVR, KNN, Naive Bayes, Decision Tree, Random Forest and Adaboost.
Apple is a well-known company that needs no introduction. As a data science engineer, I decided to use my skills to predict stock prices as a side project. One day, while looking at my iPhone, it hit me: why not use my expertise to predict Apple’s stock prices using various models and find out which one works best?
Dataset
The data I used comes from Kaggle’s dataset at https://www.kaggle.com/datasets/guillemservera/aapl-stock-data. Kaggle is known for providing clean datasets, but still I performed a few checks, like identifying missing values and outliers. As expected, the data was clean. Although the complete dataset spans from the 1980s to 2024, for my experiment, I focused on a smaller range: 2018 to 2023.
It includes information such as the opening price, closing price, high and low prices, volume of shares traded, and other metrics like adjusted close, change percent, and average volume over 20 days.
Let’s Predict Future
Prediction Using SVR Model
Support Vector Regression (SVR) is a machine learning technique used to predict continuous values, like stock prices. Unlike traditional linear regression, SVR creates a margin or a boundary where most of the data points fit, allowing it to handle outliers or irregular data more effectively. It’s particularly good for stock price prediction because it can adapt to non-linear trends, making it versatile for financial data analysis.
# Selected features and target variables from the dataset
features = ['open', 'high', 'low', 'volume', 'adjusted_close', 'change_percent', 'avg_vol_20d', 'timestamp']
X = data[features]
y = data['close']
# Normalized the data
#This step ensures that all features have a similar scale, preventing any feature from dominating the model training due to larger magnitude.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Train the SVR model
svr = SVR(kernel='rbf', C=1.0, gamma='scale')
svr.fit(X_train, y_train)
# Make predictions and evaluated the model
y_pred = svr.predict(X_test)
The result we got is:
Mean Absolute Error (MAE): 10.835217345694337
Mean Squared Error (MSE): 1052.5231545431413
R-squared: 0.7202456891060729
As you can see the results which I got are not good to increase the efficiency of my model I started using Hyperparameter tuning with GridSearchCV.
Hyperparameter tuning with GridSearchCV is a method to find the best settings for a machine learning model. Hyperparameters are the parameters you set before training, like the learning rate in a neural network or the type of kernel in a Support Vector Machine.
GridSearchCV helps you find the best combination of these hyperparameters. It creates a grid of different options and tests each combination using cross-validation, which means dividing the dataset into parts, training on some, and testing on others to ensure accuracy.
In a nutshell, GridSearchCV lets you systematically try different hyperparameter values to find the one that makes your model perform best. This makes your model more reliable and effective without having to guess the best settings.
Prediction using SVR Model (Using GridSearchCV)
# Created additional features
data['date'] = pd.to_datetime(data['date'])
data['timestamp'] = data['date'].astype(np.int64) // 10**9
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day'] = data['date'].dt.day
data['day_of_week'] = data['date'].dt.dayofweek
data['moving_avg_20'] = data['close'].rolling(window=20).mean()
data['price_diff'] = data['close'].diff()
# Fill NaN values (like in moving_avg_20)
data.fillna(method='bfill', inplace=True)
# Selected features and target variable
features = ['open', 'high', 'low', 'volume', 'adjusted_close', 'change_percent', 'avg_vol_20d', 'timestamp', 'year', 'month', 'day', 'day_of_week', 'moving_avg_20', 'price_diff']
X = data[features]
y = data['close']
# Normalized the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Hyperparameter tuning with GridSearchCV
param_grid = {
'kernel': ['rbf', 'linear', 'poly'],
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 'auto'],
'epsilon': [0.1, 0.01, 0.001]
}
svr = SVR()
grid_search = GridSearchCV(svr, param_grid, scoring='neg_mean_absolute_error', cv=5)
grid_search.fit(X_train, y_train)
# The best model is retrieved (best_estimator_), with the best hyperparameters stored in best_params_
best_model = grid_search.best_estimator_
# Made predictions and evaluated the model
y_pred = best_model.predict(X_test)
The result we got is:
Best Hyperparameters: {‘C’: 100, ‘epsilon’: 0.1, ‘gamma’: ‘scale’, ‘kernel’: ‘linear’}
Mean Absolute Error (MAE): 0.9939365677497963
Mean Squared Error (MSE): 31.075102481634186
R-squared: 0.9917404250508093
Prediction using KNN Model
K-Nearest Neighbors (KNN) is a simple yet effective machine learning technique used for both classification and regression tasks. In KNN, the prediction for a data point is made based on the “k” closest points in the dataset, known as “neighbors.”
For stock price prediction, KNN can be useful because it doesn’t make strong assumptions about the underlying data distribution and is flexible with non-linear patterns. The idea is that similar data points tend to have similar outcomes, so by considering a neighborhood of data points, KNN can make predictions based on these similarities.
# Selected features and target variable
features = ['open', 'high', 'low', 'volume', 'adjusted_close', 'change_percent', 'avg_vol_20d', 'timestamp', 'year', 'month', 'day', 'day_of_week', 'moving_avg_20', 'price_diff']
X = data[features]
y = data['close']
# Normalized the data as it is crucial for algorithms like KNN, where distance measurement is important.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Hyperparameter tuning with GridSearchCV for KNN
param_grid = {
'n_neighbors': [3, 5, 7, 9, 11], #We can Experiment with different 'k' values
'weights': ['uniform', 'distance'], # Trying both uniform and distance weighting
'metric': ['euclidean', 'manhattan', 'minkowski'] # Experimenting with distance metrics
}
knn = KNeighborsRegressor()
grid_search = GridSearchCV(knn, param_grid, scoring='neg_mean_absolute_error', cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
# Made predictions and evaluated the model
y_pred = best_model.predict(X_test)
The result we got is:
Best Hyperparameters: {‘metric’: ‘manhattan’, ‘n_neighbors’: 5, ‘weights’: ‘distance’}
Mean Absolute Error (MAE): 2.9743789532221245
Mean Squared Error (MSE): 15.397849018573007
R-squared: 0.9959073445340882
Prediction using Naive Bayes Model
Naive Bayes is a simple yet powerful classification algorithm based on Bayes’ Theorem. It assumes that each feature contributes independently to the outcome, making it a “naive” assumption. In the context of stock price movement I would say Naive Bayes can classify whether a stock’s price is likely to go up or down based on specific features.
# Creating a binary target variable indicating stock price movement
data['date'] = pd.to_datetime(data['date'])
data['price_change'] = data['close'].diff() # Price change from the previous day
data['movement'] = np.where(data['price_change'] > 0, 'up', 'down') # 'up' if price increased, 'down' if decreased
# Filled NaN values (resulting from diff())
data.fillna(method='bfill', inplace=True)
# Selected features and target variable
features = ['open', 'high', 'low', 'volume', 'adjusted_close', 'change_percent', 'avg_vol_20d']
X = data[features]
y = data['movement'] # Target variable: 'up' or 'down'
# Normalized the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Training a Naive Bayes classifier
nb = GaussianNB()
nb.fit(X_train, y_train)
# Making predictions and evaluated the model
y_pred = nb.predict(X_test)
The result we got is:
Accuracy: 0.9006622516556292
Prediction using Decision Tree Model
Decision Trees are used because they’re easy to understand and visualize, making them a great choice when you need a simple way to explain complex decisions. In a Decision Tree, you start with a big question and then split into smaller questions, just like a tree with branches.
+# Selected features and target variable
features = ['open', 'high', 'low', 'volume', 'adjusted_close', 'change_percent', 'avg_vol_20d', 'timestamp', 'year', 'month', 'day_of_week', 'moving_avg_20', 'price_diff']
X = data[features]
y = data['close']
# Normalized the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Created a Decision Tree Regressor with hyperparameter tuning
param_grid = {
'max_depth': [3, 5, 7, 10], # Controls overfitting
'min_samples_split': [2, 5, 10], # Minimum samples required for a split
'min_samples_leaf': [1, 2, 4], # Minimum samples required for a leaf node
}
decision_tree = DecisionTreeRegressor()
grid_search = GridSearchCV(decision_tree, param_grid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
# Made predictions and evaluate the model
y_pred = best_model.predict(X_test)
The result we got is:
Best Hyperparameters: {‘max_depth’: 10, ‘min_samples_leaf’: 1, ‘min_samples_split’: 5}
Mean Absolute Error (MAE): 1.2140667648368844
Mean Squared Error (MSE): 6.3588971045335985
R-squared: 0.9983098434748484
Prediction using Random Forest Model
Random Forest is a more advanced version of Decision Trees. Instead of just one Decision Tree, you build many trees and then combine their results to get a final prediction. The idea is that by using multiple trees, you get a more reliable and accurate result. It’s like asking a bunch of experts for their opinion instead of just one — it helps to avoid mistakes and get a more balanced answer.
# Selected features and target variable
features = ['open', 'high', 'low', 'volume', 'adjusted_close', 'change_percent', 'avg_vol_20d', 'timestamp', 'year', 'month', 'day', 'day_of_week', 'moving_avg_20', 'price_diff']
X = data[features]
y = data['close']
# Normalized the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Hyperparameter tuning with GridSearchCV for Random Forest
param_grid = {
'n_estimators': [50, 100, 200], # Number of trees in the forest
'max_depth': [3, 5, 10], # Maximum depth of the trees
'min_samples_split': [2, 5], # Minimum samples required to split a node
'min_samples_leaf': [1, 2, 4], # Minimum samples required to be at a leaf node
'bootstrap': [True, False], # Whether bootstrap samples are used
}
random_forest = RandomForestRegressor()
grid_search = GridSearchCV(random_forest, param_grid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(X_train, y_train)
# Get the best model and its parameters
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_
# Make predictions and evaluate the model
y_pred = best_model.predict(X_test)
The result we got is:
Best Hyperparameters: {‘bootstrap’: True, ‘max_depth’: 10, ‘min_samples_leaf’: 2, ‘min_samples_split’: 2, ‘n_estimators’: 100}
Mean Absolute Error (MAE): 0.8446546479035453
Mean Squared Error (MSE): 2.3546977587982285
R-squared: 0.9993741355275343
Prediction using Adaboost Model
AdaBoost, short for Adaptive Boosting, is a machine learning technique that combines multiple weak learners to create a strong model. A weak learner is a simple model that doesn’t do well on its own but can contribute to a more robust model when combined with others.
In AdaBoost, we start with a basic model and then focus on the errors it makes. The next model is trained to correct those errors. This process is repeated, with each new model focusing more on the data points that were misclassified by the previous ones.
# Selected features and target variable
features = ['open', 'high', 'low', 'volume', 'adjusted_close', 'change_percent', 'avg_vol_20d', 'timestamp', 'year', 'month', 'day_of_week', 'moving_avg_20', 'price_diff']
X = data[features]
y = data['close']
# Normalized the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Hyperparameter tuning with GridSearchCV for AdaBoost
param_grid = {
'n_estimators': [50, 100, 200], # Number of boosting stages
'learning_rate': [0.01, 0.1, 1], # Learning rate
'base_estimator__max_depth': [1, 2, 3], # Depth of base decision tree
}
# Using a DecisionTreeRegressor as the base estimator
base_estimator = DecisionTreeRegressor()
# AdaBoost with GridSearchCV
adaboost = AdaBoostRegressor(base_estimator=base_estimator)
grid_search = GridSearchCV(adaboost, param_grid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(X_train, y_train)
# Getting the best model and its parameters
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_
# Made predictions and evaluated the model
y_pred = best_model.predict(X_test)
The result we got is:
Best Hyperparameters: {‘base_estimator__max_depth’: 3, ‘learning_rate’: 0.1, ‘n_estimators’: 200}
Mean Absolute Error (MAE): 4.0444096617399286
Mean Squared Error (MSE): 24.959422476673847
R-squared: 0.9933659359367698
Let’s Find the Best Model
Among all the available methods, all the models scored above 0.9. The cross-validation metric has shown that all supervised machine learning methods are performing well.
Random Forest is the chosen model because it has the best score and produces good predictions that can be understood easily. The Random Forest model can also handle big data with numerous variables. This model also handles variables fast, making it more suitable for this use case.
Thanks for reading the article, love to hear your feedback and if require any help feel free to contact.