Capital Bikeshare is a bicycle-sharing system that serves Washington, D.C., and certain counties of the larger metropolitan area in both Maryland and Virginia. Users can select from electric or classic bikes at over 700 locations, either as a casual rider or with a monthly membership. The dataset is of October 2024.
The earliest well-known community bicycle started in 1965 in Amsterdam, Netherlands. Since then cities of all sizes and around the world have introduced programs. Bike-sharing is a way to solve the “last mile” problem of public transit networks, providing a bridge between people’s homes and public transit hub (“Bicycle-sharing systems”, 2024). In 2008, Washington, DC became the first city in North America to launch a bike-sharing system, called SmartBike DC. Within a couple years, Arlington, VA, Alexandria, VA, and Montgomery County, MD collaborated with DC to form Capital Bikeshare in 2010 (Capital Bikeshare,2022).
Since it’s inception, extensive research has been done on bike-sharing, and on Capital Bikeshare specifically. In 2024, Sadeghvarziri et. al. analyzed the influential factors and dynamics of bikeshare utilization in DC. This study included more than 3 million ride obvervations, and concluded temporal trends highlighted peak mid-weekday ridership, aligning with daily work commutes, and that docked bikes are favored on weekdays, electric bikes on weekends, and members are less likely to choose classic bikes. (Sadeghvaziri et. al., 2024). This valuable information can be used to make data-driven decisions about the program.
Additionally, bikesharing in DC plays an integral role in transit accessibility in DC. In 2024, Tushar & Buehler (2024), found more bike trails, bike lanes, and metrorail stations within one mile of neighborhood docking stations and shorter distances between neighborhood docking stations are associated with more bikeshare trips. This reinforces the importance of infrastructure with respect to attracting bikeshare users (Tushar & Buehler 2024).
file_path <- file.path(getwd(), "202410-capitalbikeshare-tripdata.csv")
capital_bikeshare <- read.csv(file_path, header = TRUE)
The dataset contains the following attributes.
colnames(capital_bikeshare)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
We began this project by evaluating the dataset for completeness and modified variables for usability.
colSums(is.na(capital_bikeshare))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 165814 0 171647
## start_lat start_lng end_lat end_lng
## 0 0 478 478
## member_casual
## 0
Dropping the 478 rows as they constitute only 0.066 percent of the dataset.
capital_bikeshare <- capital_bikeshare %>%
filter(!is.na(end_lat))
Distance (in miles) is derived using the latitude and
longitude coordinates of the start and end points. The
distVincentySphere() function from the
geosphere package in R is used to compute the geodesic
distance, factoring in the Earth’s curvature.# Calculate the distance for each row and add it as a new column (in miles)
capital_bikeshare$distance_miles <- mapply(function(lat1, lon1, lat2, lon2) {
distVincentySphere(c(lon1, lat1), c(lon2, lat2)) / 1000 * 0.621371
}, capital_bikeshare$start_lat, capital_bikeshare$start_lng, capital_bikeshare$end_lat, capital_bikeshare$end_lng)
started_at from ended_at.capital_bikeshare$started_at <- as.POSIXct(capital_bikeshare$started_at, format="%Y-%m-%d %H:%M:%OS")
capital_bikeshare$ended_at <- as.POSIXct(capital_bikeshare$ended_at, format="%Y-%m-%d %H:%M:%OS")
capital_bikeshare$ride_duration_mins <- as.numeric(difftime(capital_bikeshare$ended_at, capital_bikeshare$started_at, units = "mins"))
get_time_period <- function(time) {
hour <- as.numeric(format(time, "%H"))
if (hour >= 4 & hour < 6) {
return("Early Morning")
} else if (hour >= 6 & hour < 9) {
return("Morning")
} else if (hour >= 9 & hour < 12) {
return("Mid-Morning")
} else if (hour >= 12 & hour < 13) {
return("Midday")
} else if (hour >= 13 & hour < 17) {
return("Afternoon")
} else if (hour >= 17 & hour < 19) {
return("Evening")
} else if (hour >= 19 & hour < 21) {
return("Late Evening")
} else {
return("Night")
}
}
capital_bikeshare <- capital_bikeshare %>%
mutate(
start_time_period = sapply(started_at, get_time_period),
)
capital_bikeshare$day_type <- ifelse(wday(capital_bikeshare$started_at) %in% c(1, 7), "weekend", "weekday")
The following attributes are considered for further Exploratory Data Analysis (EDA):
filtered_capital_bikeshare <- capital_bikeshare %>% select(rideable_type,member_casual,distance_miles,ride_duration_mins, start_time_period,day_type)
Factorizing ‘rideable_type’, ‘member_casual’, ‘start_time_period’, and ‘day_type’ variables:
filtered_capital_bikeshare$rideable_type <- as.factor(filtered_capital_bikeshare$rideable_type)
filtered_capital_bikeshare$member_casual <- as.factor(filtered_capital_bikeshare$member_casual)
filtered_capital_bikeshare$start_time_period <- as.factor(filtered_capital_bikeshare$start_time_period)
filtered_capital_bikeshare$day_type <- as.factor(filtered_capital_bikeshare$day_type)
Removing outliers
#for distance_miles
z_scores_DM <- scale(filtered_capital_bikeshare$distance_miles)
outliers <- which(abs(z_scores_DM) > 3)
capital_bikeshare_clean <- filtered_capital_bikeshare[-outliers, ]
# for ride_duration_mins
z_scores_RDM <- scale(capital_bikeshare_clean$ride_duration_mins)
outliers <- which(abs(z_scores_RDM) > 2)
capital_bikeshare_clean <- capital_bikeshare_clean[-outliers, ]
After cleaning the dataset the number of rows are 703943.
The features considered for analysis are as follows:
rideable_type: This refers to the type of bike or
vehicle used for the ride, such as electric bikes or classic bike.
member_casual: This variable categorizes users based on
their membership status. “Member” refers to users who have a
subscription or membership, while “Casual” refers to users who use the
service on an ad-hoc, non-subscription basis.
distance_miles: This represents the distance traveled
during the ride, measured in miles. It quantifies how far the rider has
traveled using the bike-sharing service.
ride_duration_mins: This variable refers to the total
time, in minutes, that the ride lasts. It measures how long the bike or
vehicle is used during each ride.
start_time_period: This represents the time of day
during which the ride starts, such as morning, afternoon, or evening. It
helps to analyze patterns based on the time of use.
day_type: This variable refers to the type of day, such
as weekdays (Monday to Friday) or weekends (Saturday-Sunday). This is
useful for understanding usage patterns depending on the type of
day.
This project aims to predict user type (member versus casual) based on trip characteristics. Our exploratory data analysis focused first on understanding the fundamental shape of the dataset’s variables, and then on the relationship between the features and the predictor (member_casual). We used this information both to guide how we added features to the models, as well as to bolster the resultant findings from the models.
bike_type_counts <- table(capital_bikeshare_clean$rideable_type)
bike_type_df <- as.data.frame(bike_type_counts)
ggplot(bike_type_df, aes(x = Var1, y = Freq, fill = Var1)) +
geom_bar(stat = "identity") +
labs(title = "Frequency of Electric Bike vs Classic Bike Rentals",
x = "Bike Type",
y = "Frequency of Rentals") +
scale_fill_manual(values = c("electric_bike" = "blue", "classic_bike" = "green")) +
scale_y_continuous(labels = scales::comma)+
theme(legend.position = "none")
There is a higher frequency of electric bike rentals compared to classic bikes in the dataset indicates a preference for electric bikes.
Histogram
ggplot(capital_bikeshare_clean, aes(x = ride_duration_mins, fill = rideable_type)) +
geom_histogram(binwidth = 2, color = "black", alpha = 0.7, position = "dodge") +
labs(
title = "Distribution of Ride Duration by Bike Type",
x = "Ride Duration (Minutes)",
y = "Frequency"
) +
scale_fill_manual(values = c("electric_bike" = "green", "classic_bike" = "blue")) +
scale_y_continuous(labels = scales::comma) +
theme_minimal()
Box plot
ggplot(capital_bikeshare_clean, aes(x = rideable_type, y = ride_duration_mins, fill = rideable_type)) +
geom_boxplot() +
labs(title = "Boxplot of Ride Durations for Electric vs Classic Bikes",
x = "Bike Type",
y = "Ride Duration (minutes)") +
scale_fill_manual(values = c("electric_bike" = "blue", "classic_bike" = "green")) +
theme_minimal()
The distribution of ride duration by bike type is right-skewed, and the box plot highlights a significant number of extreme outliers.
Histogram
ggplot(capital_bikeshare_clean, aes(x = distance_miles, fill = rideable_type)) +
geom_histogram(binwidth = 0.3, color = "black", alpha = 0.7, position = "dodge") +
labs(
title = "Distribution of Ride Distance by Bike Type",
x = "Ride Distance (miles)",
y = "Frequency"
) +
scale_fill_manual(values = c("electric_bike" = "blue", "classic_bike" = "green")) +
scale_y_continuous(labels = scales::comma) +
theme_minimal()
Box-plot
ggplot(capital_bikeshare_clean, aes(x = rideable_type, y = distance_miles, fill = rideable_type)) +
geom_boxplot() +
labs(
title = "Box Plot of Ride Distance by Bike Type",
x = "Rideable Type",
y = "Distance (miles)"
) +
scale_fill_manual(values = c("electric_bike" = "blue", "classic_bike" = "green")) +
theme_minimal()
The distribution of ride distance by bike type is slightly more right-skewed for classic bikes compared to electric bikes, with electric bikes exhibiting a wider distribution than classic bikes.
user_type_counts <- table(capital_bikeshare_clean$member_casual)
user_type_df <- as.data.frame(user_type_counts)
Pie chart:
user_type_df <- user_type_df %>%
mutate(percentage = round(Freq / sum(Freq) * 100, 1))
ggplot(user_type_df, aes(x = "", y = Freq, fill = Var1)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(title = "Proportion of User Types (Member vs Casual)") +
scale_fill_manual(values = c("member" = "blue", "casual" = "green")) +
theme_void() +
geom_text(aes(label = paste(Var1, "\n", percentage, "%")), position = position_stack(vjust = 0.5)) +
theme(legend.position = "none")
There is a higher percentage of member-type riders in October.
user_bike_counts <- capital_bikeshare_clean %>%
group_by(member_casual, rideable_type) %>%
summarise(Freq = n())
ggplot(user_bike_counts, aes(x = rideable_type, y = Freq, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "User Type Distribution by Rideable Type",
x = "Rideable Type (Bike Type)",
y = "Frequency of Rides") +
scale_fill_manual(values = c("member" = "blue", "casual" = "green")) +
theme_minimal() +
scale_y_continuous(labels = scales::comma)
Electric bikes have a higher proportion of member-type riders compared to casual-type riders.
ggplot(capital_bikeshare_clean, aes(x = day_type, fill = interaction(member_casual, rideable_type))) +
geom_bar(position = "dodge") +
labs(title = "Ride characteristics on weekdays vs weekends",
x = "Day Type",
y = "Count",
fill = "Member Type & Rideable Type") +
theme_minimal()
- Overall, there are fewer rides on weekends, with a significant
decrease in rides by member-type users during the weekend.
The exploratory data analysis of the Capital Bikeshare dataset for October 2024 provides clear distinctions between the usage patterns of members and casual users, highlighting their differing motivations and behaviors. Members, who represent the majority of rides, exhibit consistent weekday activity, particularly during peak commuting hours. This behavior aligns with practical and routine travel needs, such as commuting or regular errands. In contrast, casual users are more active on weekends and prefer shorter trips, suggesting their primary use is leisure-oriented.
A key finding is the strong preference for electric bikes, particularly among casual users, who likely value the ease and accessibility they provide for infrequent trips. Ride duration and distance also serve as critical differentiators: members tend to take longer rides over greater distances, whereas casual users favor shorter rides. Temporal patterns further underscore the importance of time and day in shaping user behavior. Members dominate weekday rides, while casual users significantly increase their activity on weekends and evenings.
Our goal was to use trip characteristics to predict whether users are members or casual. This is a binary classification problem, so we utilized models suited to this type of problem: decision tree and logistic regression.
# K-fold cross-validation
set.seed(123)
fold <- floor(runif(nrow(capital_bikeshare_clean), 1, 6))
capital_bikeshare_clean$fold <- fold
#vector to store results
accuracy_results <- c()
# Loop through each fold
for (i in 1:5) {
# Create the test and training sets
test.set <- capital_bikeshare_clean[capital_bikeshare_clean$fold == i,]
train.set <- capital_bikeshare_clean[capital_bikeshare_clean$fold != i,]
# Train the decision tree model
Member_casual_tree <- rpart(member_casual ~ rideable_type + distance_miles + ride_duration_mins + start_time_period + day_type,
data = train.set,
method = "class")
# Visualize the tree
#rpart.plot(Member_casual_tree, type = 3, extra = 1, main = "Classification Tree for Member vs Casual Prediction")
# summary of the tree
printcp(Member_casual_tree)
# predictions on the test set
pred_prob <- predict(Member_casual_tree, test.set, type = "prob")
# probabilities to predicted classes (1 for 'Member', 2 for 'Casual')
pred_class <- ifelse(pred_prob[, 2] > 0.5, "casual", "member") # Assuming 2nd class (casual) is the positive class
# accuracy of the current fold
accuracy <- sum(pred_class == test.set$member_casual) / nrow(test.set)
accuracy_results[i] <- accuracy
}
##
## Classification tree:
## rpart(formula = member_casual ~ rideable_type + distance_miles +
## ride_duration_mins + start_time_period + day_type, data = train.set,
## method = "class")
##
## Variables actually used in tree construction:
## [1] ride_duration_mins rideable_type
##
## Root node error: 2e+05/562631 = 0.3
##
## n= 562631
##
## CP nsplit rel error xerror xstd
## 1 0.02 0 1 1 0.002
## 2 0.01 2 1 1 0.002
##
## Classification tree:
## rpart(formula = member_casual ~ rideable_type + distance_miles +
## ride_duration_mins + start_time_period + day_type, data = train.set,
## method = "class")
##
## Variables actually used in tree construction:
## [1] ride_duration_mins rideable_type
##
## Root node error: 2e+05/563186 = 0.3
##
## n= 563186
##
## CP nsplit rel error xerror xstd
## 1 0.02 0 1 1 0.002
## 2 0.01 2 1 1 0.002
##
## Classification tree:
## rpart(formula = member_casual ~ rideable_type + distance_miles +
## ride_duration_mins + start_time_period + day_type, data = train.set,
## method = "class")
##
## Variables actually used in tree construction:
## [1] ride_duration_mins rideable_type
##
## Root node error: 2e+05/563366 = 0.3
##
## n= 563366
##
## CP nsplit rel error xerror xstd
## 1 0.02 0 1 1 0.002
## 2 0.01 2 1 1 0.002
##
## Classification tree:
## rpart(formula = member_casual ~ rideable_type + distance_miles +
## ride_duration_mins + start_time_period + day_type, data = train.set,
## method = "class")
##
## Variables actually used in tree construction:
## [1] ride_duration_mins rideable_type
##
## Root node error: 2e+05/563361 = 0.3
##
## n= 563361
##
## CP nsplit rel error xerror xstd
## 1 0.02 0 1 1 0.002
## 2 0.01 2 1 1 0.002
##
## Classification tree:
## rpart(formula = member_casual ~ rideable_type + distance_miles +
## ride_duration_mins + start_time_period + day_type, data = train.set,
## method = "class")
##
## Variables actually used in tree construction:
## [1] ride_duration_mins rideable_type
##
## Root node error: 2e+05/563228 = 0.3
##
## n= 563228
##
## CP nsplit rel error xerror xstd
## 1 0.02 0 1 1 0.002
## 2 0.01 2 1 1 0.002
# average accuracy across all folds
mean_accuracy <- mean(accuracy_results)
print(paste("Average accuracy across all folds: ", round(mean_accuracy, 4)))
## [1] "Average accuracy across all folds: 0.3113"
This analysis focuses on building a decision tree model to explore how features like rideable type, distance, ride duration, time of day, and day type influence the classification of bike riders as either “Member” or “Casual.” The methodology below outlines the steps for training, evaluating, and visualizing the model using K-fold cross-validation.
We assign each observation a fold number ranging from 1 to 5. This division allows each fold to act as a test set in turn while the rest of the data is used for training.
Decision Tree Model Training
We use the rpart package in R to build a decision tree
classifier. The target variable, member_casual, is predicted based on
features - rideable_type, distance_miles,
ride_duration_mins, start_time_period,
day_type.
For each fold, the decision tree model is trained on the training dataset using the rpart function, with the method specified as “class” since it is a classification problem. The model is then summarized to obtain key information about the tree’s structure.
Model Evaluation
After training the model on the training set, predictions are made on the test set. The predicted probabilities for each class (“Member” or “Casual”) are generated using the predict function. The predicted class is then determined by comparing the probabilities to a threshold of 0.5 (i.e., if the probability of being “Casual” exceeds 0.5, the prediction is classified as “Casual”; otherwise, it is classified as “Member”).
The accuracy for each fold is calculated by comparing the predicted
class labels to the true labels in the test set. These accuracies are
stored in the accuracy_results vector.
Accuracy Calculation
After completing all five folds, the average accuracy across all folds is calculated. This average provides a more robust evaluation of the model’s performance, as it considers the variation across different subsets of the data.
# final tree visualization
rpart.plot(Member_casual_tree, type = 3, extra = 1, main = "Final Classification Tree")
Final Decision Tree Visualization
After completing the cross-validation process, a final decision tree is trained using the entire dataset. This final model is visualized using the rpart.plot function to present a clear representation of the decision tree, showing how different features contribute to the classification of bike riders.
1] After running the classification tree, we found that the two most important variables in predicting whether a rider is a member or casual were ride duration and rideable type. This suggests that the length of the ride and the type of bike used are key differentiators between the two groups. For example, members might use bikes for longer commutes, while casual riders could tend to make shorter trips. Similarly, the type of bike may indicate a preference for certain users, with members possibly opting for more electric bikes.
2] The root node error (the initial error before any splits) was 0.3% for each fold, which provides an initial baseline for the classification accuracy. The cross-validation error, indicated as xerror, remained consistently low across different subsets of the data, fluctuating by 0.002 in standard deviation.
3] The average accuracy across all the folds came out to be around 31.13%.While this is a modest result, this decision tree model provides valuable insights into how trip characteristics influence whether a rider is a member or a casual user. Another interesting limitation of this tree is that the number of splits is 3. This reflects that there is an imbalance between members and casual riders in the dataset, the tree might favor the majority class and not make many splits, as there’s less incentive to separate the minority class.
Model selection started with duration (ride_duration_mins) only and then added bike type (rideable_type) because the decision tree indicated these were the most important variables. Because this dataset does not have many variables, and nearly all are statistically significant in the full model, we elected a stepwise approach to model building, and used ANOVA to compare model quality.
riderLogit01 <- glm(member_casual ~ ride_duration_mins, data = capital_bikeshare_clean, family = "binomial")
summary(riderLogit01)
riderLogit02 <- glm(member_casual ~ rideable_type + ride_duration_mins, data = capital_bikeshare_clean, family = "binomial")
summary(riderLogit02)
riderLogit03 <- glm(member_casual ~ rideable_type + ride_duration_mins + day_type, data = capital_bikeshare_clean, family = "binomial")
summary(riderLogit03)
riderLogit04 <- glm(member_casual ~ rideable_type + distance_miles + ride_duration_mins + day_type, data = capital_bikeshare_clean, family = "binomial")
summary(riderLogit04)
riderLogit05 <- glm(member_casual ~ rideable_type * ride_duration_mins, data = capital_bikeshare_clean, family = "binomial")
summary(riderLogit05)
riderLogit06 <- glm(member_casual ~ rideable_type * ride_duration_mins * day_type, data = capital_bikeshare_clean, family = "binomial")
summary(riderLogit06)
riderLogit07 <- glm(member_casual ~ rideable_type * distance_miles * ride_duration_mins * day_type, data = capital_bikeshare_clean, family = "binomial")
summary(riderLogit07)
riderLogit08 <- glm(member_casual ~ rideable_type * distance_miles * ride_duration_mins * day_type * start_time_period, data = capital_bikeshare_clean, family = "binomial")
summary(riderLogit08)
riderLogit09 <- glm(member_casual ~ rideable_type + distance_miles + ride_duration_mins + start_time_period + day_type, data = capital_bikeshare_clean, family = "binomial")
summary(riderLogit09)
riderLogit10 <- glm(member_casual ~ rideable_type * ride_duration_mins + distance_miles + start_time_period + day_type, data = capital_bikeshare_clean, family = "binomial")
summary(riderLogit10)
riderLogit11 <- glm(member_casual ~ distance_miles + rideable_type * ride_duration_mins + day_type, data = capital_bikeshare_clean, family = "binomial")
summary(riderLogit11)
We then ran a series of ANOVA tests to compare models and select the best.
anova_result <- anova(riderLogit01, riderLogit02, riderLogit03, riderLogit04, riderLogit05, riderLogit06, riderLogit07, riderLogit08, riderLogit09, test = "Chisq")
print(anova_result)
anova_result <- anova(riderLogit04, riderLogit09, test = "Chisq")
print(anova_result)
anova_result <- anova(riderLogit09, riderLogit10, test = "Chisq")
print(anova_result)
anova_result <- anova(riderLogit04, riderLogit11, test = "Chisq")
print(anova_result)
anova_result <- anova(riderLogit11, riderLogit10, test = "Chisq")
print(anova_result)
anova_result <- anova(riderLogit04, riderLogit10, test = "Chisq")
print(anova_result)
## Analysis of Deviance Table
##
## Model 1: member_casual ~ rideable_type + distance_miles + ride_duration_mins +
## day_type
## Model 2: member_casual ~ rideable_type * ride_duration_mins + distance_miles +
## start_time_period + day_type
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 703938 855692
## 2 703930 847441 8 8251 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We also checked the Variance Inflation Factor to ensure there was no multicollinearity.
vif(riderLogit10)
## GVIF Df GVIF^(1/(2*Df))
## rideable_type 2.69 1 1.64
## ride_duration_mins 2.98 1 1.73
## distance_miles 1.42 1 1.19
## start_time_period 1.05 7 1.00
## day_type 1.04 1 1.02
## rideable_type:ride_duration_mins 3.90 1 1.97
Model Selection
Initially Models 4 and 6 had the best reduction in residual deviance for minimal change in degrees of freedom that was statistically significant. However the VIF for model 6 indicated a significant amount of multicollinearity. So then we compared model 4 to the full model (9), which showed a smaller but still statistically significant reduction to residual deviance. However this model increased the complexity by 7 degrees of freedom. We tried 2 other versions of the model, incorporating the interaction terms between duration and bike type, but with and without the time-of-day (start_time_period). Model10(riderLogit10) including all variables and the interaction between duration and bike type is a better fit when compared via ANOVA to the model without time-of-day, with a statistically significant reduction in residual deviance.
riderLogit10 <- glm(member_casual ~ rideable_type * ride_duration_mins + distance_miles + start_time_period + day_type, data = capital_bikeshare_clean, family = "binomial")
summary(riderLogit10)
##
## Call:
## glm(formula = member_casual ~ rideable_type * ride_duration_mins +
## distance_miles + start_time_period + day_type, family = "binomial",
## data = capital_bikeshare_clean)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 1.517033 0.008977 168.98
## rideable_typeelectric_bike -0.613695 0.008967 -68.44
## ride_duration_mins -0.062254 0.000432 -144.11
## distance_miles 0.119757 0.003637 32.93
## start_time_periodEarly Morning 0.002128 0.030383 0.07
## start_time_periodEvening 0.093471 0.007748 12.06
## start_time_periodLate Evening 0.095659 0.009251 10.34
## start_time_periodMid-Morning 0.077260 0.008767 8.81
## start_time_periodMidday 0.014468 0.012169 1.19
## start_time_periodMorning 0.404484 0.009946 40.67
## start_time_periodNight 0.016833 0.009138 1.84
## day_typeweekend -0.418876 0.005843 -71.69
## rideable_typeelectric_bike:ride_duration_mins 0.040408 0.000517 78.13
## Pr(>|z|)
## (Intercept) <2e-16 ***
## rideable_typeelectric_bike <2e-16 ***
## ride_duration_mins <2e-16 ***
## distance_miles <2e-16 ***
## start_time_periodEarly Morning 0.944
## start_time_periodEvening <2e-16 ***
## start_time_periodLate Evening <2e-16 ***
## start_time_periodMid-Morning <2e-16 ***
## start_time_periodMidday 0.234
## start_time_periodMorning <2e-16 ***
## start_time_periodNight 0.065 .
## day_typeweekend <2e-16 ***
## rideable_typeelectric_bike:ride_duration_mins <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 888015 on 703942 degrees of freedom
## Residual deviance: 847441 on 703930 degrees of freedom
## AIC: 847467
##
## Number of Fisher Scoring iterations: 4
Logistic Regression Results
The intercept is log-odds 1.52, the log-odds of being a member when the bike is a classic, ride duration and distance are 0, it is the afternoon, and it is a weekday.
Electric bike (versus classic bike) reduces the log-odds of being a member by log-odds 0.631.
Each additional minute of ride duration decrease the log-odds of being a member by 0.062.
Each additional mile of distance ridden increases the log-odds of being a member b 0.12.
Evening, Late evening, Mid-morning, Morning increase the log-odds of being a member compared to afternoon.
Early morning, midday, and Night are not statistically significant at alpha = 0.05.
Weekend rides reduce the likelihood of being a member by log-odds 0.42.
The interaction term indicates that for each additional minute of ride duration on an electric bike, the negative effect of ride duration (see above) on the log-odds of being a member is reduced by 0.04. This means longer rides on electric bikes are less strongly associated with with casual riders on classic bikes.
exp(coef(riderLogit10))
## (Intercept)
## 4.559
## rideable_typeelectric_bike
## 0.541
## ride_duration_mins
## 0.940
## distance_miles
## 1.127
## start_time_periodEarly Morning
## 1.002
## start_time_periodEvening
## 1.098
## start_time_periodLate Evening
## 1.100
## start_time_periodMid-Morning
## 1.080
## start_time_periodMidday
## 1.015
## start_time_periodMorning
## 1.499
## start_time_periodNight
## 1.017
## day_typeweekend
## 0.658
## rideable_typeelectric_bike:ride_duration_mins
## 1.041
Odds Ratio Interpretation
When the log-odds are exponentiated, we can use the odds ratios (OR) to better understand the impact of the variables.
The intercept is 4.559 meaning the odds of being a member are 4.56 times higher than being a casual user when the bike is a classic, it is a weekday afternoon, and ride duration and distance are 0.
The odds ratio for electric bike is 0.541, meaning riders on electric bikes are 46% (1-0.541) less likely to be members compared to classic bikes.
The odds ratio for duraiton is 0.94, meaning each additional minute of ride duration decreases the odds of being a member by 6% (1-0.94).
The odds ratio for distance is 1.127, meaning each additional mile of distance increases the odds of being a member by 13%.
The odds ratio for morning rides is 1.499, meaning meaning morning rides are 50% more likely to be a member than afternoon rides.
The odds ratio for mid-morning is 1.080, meaning odds of being a member is 8% higher than in the afternoon.
The odds ratio for evening is 1.098, meaning the odds of being a member is 9.8% higher than in the afternoon.
The odds ratio for late evening is 1.10, meaning the odds of being a memberis 10% higher than in the afternoon.
The odds ratio for weekend rides is 0.568, meaning riders on weekends are 34% (1-0.658) less likely to be members.
The odds ratio for the interaction between bike type and duration is 1.041, meaning each additional minute of ride time on electric bikes increases the odds of being a memebr by 4.1%, which counteracts the negative effect of longer rides overall.
null_deviance <- 888015
residual_deviance <- 847441
delta_df <- 703942 - 703930
delta_D <- null_deviance - residual_deviance
p_value <- pchisq(delta_D, df = delta_df, lower.tail = FALSE)
cat("Change in deviance:", delta_D, "\nDegrees of freedom:", delta_df, "\nP-value:", p_value, "\n")
## Change in deviance: 40574
## Degrees of freedom: 12
## P-value: 0
Deviance Test
A Deviance Test was performed to assess model fit.
H0: The null model is better than the model with predictors
H1: The model will predictors is better/improves model fit.
This model shows a substantial reduction in deviance (40574) for 12 degrees of freedom with a p-value essentially equal to zero. We reject H0 and conclude the model with predictors fits the data significantly better than the null model.
predicted_probs <- predict(riderLogit10, type = "response")
predicted_classes <- ifelse(predicted_probs > 0.5, "member", "casual")
predicted_classes <- factor(predicted_classes, levels = c("member", "casual"))
actual_classes <- factor(capital_bikeshare_clean$member_casual, levels = c("member", "casual"))
conf_matrix <- confusionMatrix(predicted_classes, actual_classes)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction member casual
## member 458581 199694
## casual 16424 29244
##
## Accuracy : 0.693
## 95% CI : (0.692, 0.694)
## No Information Rate : 0.675
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.118
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.965
## Specificity : 0.128
## Pos Pred Value : 0.697
## Neg Pred Value : 0.640
## Prevalence : 0.675
## Detection Rate : 0.651
## Detection Prevalence : 0.935
## Balanced Accuracy : 0.547
##
## 'Positive' Class : member
##
accuracy <- conf_matrix$overall["Accuracy"]
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 0.693"
Confusion Matrix and Accuracy Tests
This model is approximately 70% accurate, which is less than the 80% target but decent considering the data for this month is moderately skewed toward members. The No Information Rate (NIR) is 0.675 or 67.5%, which is the accuracy of a naive model always predicting the majority class - this model is statistically significantly better than the NIR. However the Kappa is low, suggesting limited improvement over guessing and Mcnemar’s test indicates the model misclassifies disporportionately.Not surprisingly, this model is very sensitive (very good at identifying members), but not very specific (not good at identifying casual riders).
predicted_probs <- predict(riderLogit10, type = "response")
predicted_classes <- ifelse(predicted_probs > 0.61, "member", "casual")
predicted_classes <- factor(predicted_classes, levels = c("member", "casual"))
actual_classes <- factor(capital_bikeshare_clean$member_casual, levels = c("member", "casual"))
conf_matrix <- confusionMatrix(predicted_classes, actual_classes)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction member casual
## member 411492 162584
## casual 63513 66354
##
## Accuracy : 0.679
## 95% CI : (0.678, 0.68)
## No Information Rate : 0.675
## P-Value [Acc > NIR] : 2.33e-13
##
## Kappa : 0.176
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.866
## Specificity : 0.290
## Pos Pred Value : 0.717
## Neg Pred Value : 0.511
## Prevalence : 0.675
## Detection Rate : 0.585
## Detection Prevalence : 0.816
## Balanced Accuracy : 0.578
##
## 'Positive' Class : member
##
accuracy <- conf_matrix$overall["Accuracy"]
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 0.6788"
Confusion Matrix and Accuracy Test with Increased Cutoff
To improve sensitivity and specificity, we changed the cutoff to 0.61 (increased from 0.5). This reduced the sensitivity from 0.965 to 0.866, increased the specificity from 0.128 to 0.290, and only minimally reduced the accuracy from 0.693 to 0.679. At the 0.61 cutoff, this model is still statistically better than the NIR; at 0.62 it is not, so 0.61 is the likely the best cutoff.
predicted_probs10 <- predict(riderLogit10, newdata = capital_bikeshare_clean, type = "response")
roc_curve10 <- roc(response = capital_bikeshare_clean$member_casual, predicted_probs10, levels = c("casual", "member"))
plot(roc_curve10, main = "ROC Curve for Logistic Regression Model", col = "blue", lwd = 2)
auc_value10 <- auc(roc_curve10)
print(paste("AUC (ModelMetrics):", round(auc_value10, 4)))
## [1] "AUC (ModelMetrics): 0.6337"
Reciever Operating Characteristic (ROC) Curve and Area Under the Curve (AUC)
The AUC of the model is 0.6337 which is certainly below the target of 0.8, but still better than the null model.
riderLogit10pr2 = pR2(riderLogit10)
## fitting null model for pseudo-r2
riderLogit10pr2
## llh llhNull G2 McFadden r2ML r2CU
## -4.24e+05 -4.44e+05 4.06e+04 4.57e-02 5.60e-02 7.81e-02
Measures of Pseudo R-square including McFadden’s
The McFadden’s R-square is 0.0457 or (4.57%), which indicates this model explains a small but meaningful portion of variation in outcome compared to the null model.
Riders are overall more likely to be members. Electric bikes and longer rides reduce the odds, though less so if the longer ride is on an electric bike. Longer distances increase the odds of membership, while weekends reduce the odds. Morning rides strongly increase the odds of membership.
While the decision tree was not highly accurate for predicting member type (~31%), the logistic regression model was approximately 70% accurate.This is a reasonable result given the small number of variables given.
Additionally, this dataset is a small snapshot of users and may not accurately reflect the true distribution of users over the course of the year. In order to accommodate the imbalance in the classes, we could create a subset sample that oversamples the casual riders (the minority class) and undersamples the member riders (the majority class), to create a balanced sample for the the analysis. However we did not feel the imbalance was so extreme that this would be necessary. Additionally, we suspect that if an entire year were sampled, the classes may be more balanced. The results of the study by Sadeghvazi et al. (2024) were likely different, and even opposite, from ours because they examined data from an entire year of usage.
There are many additional factors that likely affect class distribution and bike usage including time of year/season, weather, sunset times. Capital Bikeshare usage is also likely affected by geospatial factors like distance between home and a docking station, and distance to other forms of public transit. Finally, socioeconomic factors also likely have a significant impact. Future research might examine an integration of these variables into the current model for more accurate outcomes.
Bicycle-sharing systems. (2024, November 7). In Wikipedia. https://en.wikipedia.org/wiki/Bicycle-sharing_system
About Company & History. (n.d). Capital Bikeshare. Retrieved December 13, 2024, from https://ride.capitalbikeshare.com/about
Sadeghvaziri, E., Javid, R., Constantin, N. (2024). Analyzing the Influential Factors and Dynamics of Bikeshare Utilization in Washington, DC. International Conference on Transportation and Development. https://doi.org/10.1061/9780784485521.028
Tushar, M.D., & Buehler, R. (2024). Bikeshare–Metrorail Integration in Washington, D.C.: What are the Characteristics of Neighborhoods that Encourage Capital Bikeshare Trips to and from the Metrorail? Journal of the Transportation Research Board, 2678(10). https://doi.org/10.1177/03611981241233275