Mini Project #1: Fiscal Characteristics of Major US Public Transit Systems

Author

Elissa Leung

Introduction

The major United States public transit systems contribute significantly to individuals daily transportation, allowing commuters alternative and often more affordable transit options for commutes. The goal of this project is to explore different characteristics of the major transit systems in the United States, based on data from the National Transit Database. More specifically, the analysis includes data from the 2022 Fare Revenue table, the most recent Monthly Ridership table, and the 2022 Operating Expenses reports. This report intends to consider various statistics including farebox recovery, ridership, total trips, total vehicle miles traveled, total revenues and expenses for various locations and transit agencies, to analyze the performance of these transit systems over time. Ultimately, the analysis of certain metrics will assist in determining the most efficient transit system in the United States.

Data Cleaning

Before starting the analysis, the relevant data files need to be loaded and cleaned into data frames on R. Below consists of the code required to download the data files and create the relevant data frames by merging different tables. From the original data files four data frames are created: FARES, EXPENSES, TRIPS, and MILES. Following this, FARES and EXPENSES are merged into the FINANCIALS data frame and TRIPS and MILES are merged into the USAGE data frame. From this point forward, only the USAGE and FINANCIALS data frames will be necessary to conduct the analysis.

if(!require("dplyr")) install.packages("dplyr")

Loading required package: dplyr


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(dplyr)

if(!require("tidyverse")) install.packages("tidyverse")

Loading required package: tidyverse

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.1     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Let's start with Fare Revenue
library(tidyverse)
if(!file.exists("2022_fare_revenue.xlsx")){
    # This should work _in theory_ but in practice it's still a bit finicky
    # If it doesn't work for you, download this file 'by hand' in your
    # browser and save it as "2022_fare_revenue.xlsx" in your project
    # directory.
    download.file("http://www.transit.dot.gov/sites/fta.dot.gov/files/2024-04/2022%20Fare%20Revenue.xlsx", 
                  destfile="2022_fare_revenue.xlsx", 
                  quiet=FALSE, 
                  method="wget")
}
FARES <- readxl::read_xlsx("2022_fare_revenue.xlsx") |>
    select(-`State/Parent NTD ID`, 
           -`Reporter Type`,
           -`Reporting Module`,
           -`TOS`,
           -`Passenger Paid Fares`,
           -`Organization Paid Fares`) |>
    filter(`Expense Type` == "Funds Earned During Period") |>
    select(-`Expense Type`) |>
    group_by(`NTD ID`,       # Sum over different `TOS` for the same `Mode`
             `Agency Name`,  # These are direct operated and sub-contracted 
             `Mode`) |>      # of the same transit modality
                             # Not a big effect in most munis (significant DO
                             # tends to get rid of sub-contractors), but we'll sum
                             # to unify different passenger experiences
    summarize(`Total Fares` = sum(`Total Fares`)) |>
    ungroup()

`summarise()` has grouped output by 'NTD ID', 'Agency Name'. You can override
using the `.groups` argument.

# Next, expenses
if(!file.exists("2022_expenses.csv")){
    # This should work _in theory_ but in practice it's still a bit finicky
    # If it doesn't work for you, download this file 'by hand' in your
    # browser and save it as "2022_expenses.csv" in your project
    # directory.
    download.file("https://data.transportation.gov/api/views/dkxx-zjd6/rows.csv?date=20231102&accessType=DOWNLOAD&bom=true&format=true", 
                  destfile="2022_expenses.csv", 
                  quiet=FALSE, 
                  method="wget")
}
EXPENSES <- readr::read_csv("2022_expenses.csv") |>
    select(`NTD ID`, 
           `Agency`,
           `Total`, 
           `Mode`) |>
    mutate(`NTD ID` = as.integer(`NTD ID`)) |>
    rename(Expenses = Total) |>
    group_by(`NTD ID`, `Mode`) |>
    summarize(Expenses = sum(Expenses)) |>
    ungroup()

Rows: 3744 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Agency, City, State, NTD ID, Organization Type, Reporter Type, UZA...
dbl  (2): Report Year, UACE Code
num (10): Primary UZA Population, Agency VOMS, Mode VOMS, Vehicle Operations...
lgl  (7): Vehicle Operations Questionable, Vehicle Maintenance Questionable,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
`summarise()` has grouped output by 'NTD ID'. You can override using the `.groups` argument.

FINANCIALS <- inner_join(FARES, EXPENSES, join_by(`NTD ID`, `Mode`))

# Monthly Transit Numbers
library(tidyverse)
if(!file.exists("ridership.xlsx")){
    # This should work _in theory_ but in practice it's still a bit finicky
    # If it doesn't work for you, download this file 'by hand' in your
    # browser and save it as "ridership.xlsx" in your project
    # directory.
    download.file("https://www.transit.dot.gov/sites/fta.dot.gov/files/2024-09/July%202024%20Complete%20Monthly%20Ridership%20%28with%20adjustments%20and%20estimates%29_240903.xlsx", 
                  destfile="ridership.xlsx", 
                  quiet=FALSE, 
                  method="wget")
}
TRIPS <- readxl::read_xlsx("ridership.xlsx", sheet="UPT") |>
            filter(`Mode/Type of Service Status` == "Active") |>
            select(-`Legacy NTD ID`, 
                   -`Reporter Type`, 
                   -`Mode/Type of Service Status`, 
                   -`UACE CD`, 
                   -`TOS`) |>
            pivot_longer(-c(`NTD ID`:`3 Mode`), 
                            names_to="month", 
                            values_to="UPT") |>
            drop_na() |>
            mutate(month=my(month)) # Parse _m_onth _y_ear date specs
MILES <- readxl::read_xlsx("ridership.xlsx", sheet="VRM") |>
            filter(`Mode/Type of Service Status` == "Active") |>
            select(-`Legacy NTD ID`, 
                   -`Reporter Type`, 
                   -`Mode/Type of Service Status`, 
                   -`UACE CD`, 
                   -`TOS`) |>
            pivot_longer(-c(`NTD ID`:`3 Mode`), 
                            names_to="month", 
                            values_to="VRM") |>
            drop_na() |>
            group_by(`NTD ID`, `Agency`, `UZA Name`, 
                     `Mode`, `3 Mode`, month) |>
            summarize(VRM = sum(VRM)) |>
            ungroup() |>
            mutate(month=my(month)) # Parse _m_onth _y_ear date specs

`summarise()` has grouped output by 'NTD ID', 'Agency', 'UZA Name', 'Mode', '3
Mode'. You can override using the `.groups` argument.

USAGE <- inner_join(TRIPS, MILES) |>
    mutate(`NTD ID` = as.integer(`NTD ID`))

Joining with `by = join_by(`NTD ID`, Agency, `UZA Name`, Mode, `3 Mode`,
month)`

Note

The code below checks for possible lingering null values in the USAGE and FINANCIALS data frames before conducting analysis. An output of 0 for each of the data frames ensures that there are no null values in these data frames.

USAGE |>
  is.na() |>
  sum()

[1] 0

FINANCIALS |>
  is.na() |>
  sum()

[1] 0

Since, both of these chunks produced an output of 0, there are no null values in these data tables that may alter calculations and analysis in the future.

The USAGE table (shown below) provides transit system information including (but not limited to): agency, mode of transportation, total unlinked passenger trips, and total vehicle revenue miles.

if(!require("DT")) install.packages("DT")

Loading required package: DT

library(DT)

sample_n(USAGE, 1000) |> 
  mutate(month=as.character(month)) |> 
  arrange(`NTD ID`, month) |>
  DT::datatable(rownames = FALSE, 
                options = list(pageLength = 5))

Data Transformation

Prior to analyzing data, it is important to make data tables as clear as possible to ensure clarity and accurate interpretation of results. So, renaming the column UZA Name to metro_area clarifies that this column contains the location (city, state) of the transit systems.

USAGE <- USAGE |> 
  rename(metro_area = 'UZA Name')

Additionally, the Mode column contains some abbreviated codes, so converting these codes into understandable terms is helpful for future analysis.

Note

The code below is used help to streamline the process of converting the codes into the terms, by showing the unique codes in the Mode column. Ultimately making the process of searching for these terms in the National Transit Database (NTD) Glossary much more efficient.

unique(USAGE$Mode)

 [1] "DR" "FB" "MB" "SR" "TB" "VP" "CB" "RB" "LR" "YR" "MG" "CR" "AR" "TR" "HR"
[16] "IP" "PB" "CC"

USAGE <- USAGE |>
  mutate(Mode = case_when(
    Mode == "DR" ~ "Demand Response",
    Mode == "FB" ~ "Ferryboat",
    Mode == "MB" ~ "Bus",
    Mode == "SR" ~ "Streetcar",
    Mode == "TB" ~ "Trolleybus",
    Mode == "VP" ~ "Vanpool",
    Mode == "CB" ~ "Commuter Bus",
    Mode == "RB" ~ "Bus Rapid Transit",
    Mode == "LR" ~ "Light Rail",
    Mode == "YR" ~ "Hybrid Rail",
    Mode == "MG" ~ "Monorail/Automated Guideway Transit",
    Mode == "CR" ~ "Commuter Rail",
    Mode == "AR" ~ "Alaska Railroad",
    Mode == "TR" ~ "Aerial Tramways",
    Mode == "HR" ~ "Heavy Rail",
    Mode == "IP" ~ "Inclined Plane",
    Mode == "PB" ~ "Publico",
    Mode == "CC" ~ "Cable Car",
    TRUE ~ "Unknown"
  ))

Furthermore, to clean up the data table even more, the removal of columns NTD ID and 3 Mode could help clear unnecessary information before analysis. Renaming certain columns like UPT and VRM to unlinked_passenger_trips and vehicle_revenue_miles will also provide more clarity for interpretation. For this, a new data table USAGE_clean is created, shown below.

USAGE_clean <- USAGE |> 
  select(-`NTD ID`, -`3 Mode`) |> 
  rename(unlinked_passenger_trips = UPT, vehicle_revenue_miles = VRM)

sample_n(USAGE_clean, 1000) |> 
    mutate(month=as.character(month)) |> 
    DT::datatable(rownames = FALSE,
                  options = list(pageLength = 5))

Ridership Analysis

Preliminary Analysis:

Below are some preliminary questions to explore the USAGE_clean data table:

What transit agency had the most total VRM in our data set?
What transit mode had the most total VRM in our data set?
How many trips were taken on the NYC Subway (Heavy Rail) in May 2024?
How much did NYC subway ridership fall between April 2019 and April 2020?

Question #1

transit_agency_max_total_VRM <- 
  USAGE_clean |>
  group_by(Agency) |>
  summarize(total_VRM = sum(vehicle_revenue_miles)) |>
  slice_max(total_VRM, n = 1) |>
  pull(Agency)

transit_agency_max_total_VRM

[1] "MTA New York City Transit"

transit_agency_max_total_VRM_dist <- 
  USAGE_clean |>
  group_by(Agency) |>
  summarize(total_VRM = sum(vehicle_revenue_miles)) |>
  slice_max(total_VRM, n = 1) |>
  pull(total_VRM)

transit_agency_max_total_VRM_dist

[1] 10832855350

The transit agency that had the most total vehicle revenue miles in the sample was MTA New York City Transit, with a total of about 10.83 billion vehicle revenue miles. Being that the MTA New York City Transit is one of the largest transit systems in the entire world and the largest in the US, it is unsurprising that this transit agency had the most total VRM in the data set.

Question #2

transit_mode_max_total_VRM <- 
  USAGE_clean |>
  group_by(Mode) |>
  summarize(total_VRM = sum(vehicle_revenue_miles)) |>
  slice_max(total_VRM, n = 1) |>
  pull(Mode)

transit_mode_max_total_VRM

[1] "Bus"

transit_mode_max_total_VRM_dist <- 
  USAGE_clean |>
  group_by(Mode) |>
  summarize(total_VRM = sum(vehicle_revenue_miles)) |>
  slice_max(total_VRM, n = 1) |>
  pull(total_VRM)

transit_mode_max_total_VRM_dist

[1] 49444494088

The transit mode with the most total vehicle revenue miles in this sample was the Bus, having a total of roughly 49.44 billion vehicle revenue miles. Buses make up a large portion of public transportation, especially in cities that don’t have heavy rail trains like NYC, so it is reasonable that buses had the most vehicle revenue miles.

Question #3

if(!require("lubridate")) install.packages("lubridate")
library(lubridate)

total_trips_NYC_subway_2024 <- USAGE_clean |>
  filter(Agency == "MTA New York City Transit", 
         Mode == "Heavy Rail", 
         year(month) == 2024) |>
  group_by(mon = month(month)) |>
  summarize(total_trips = sum(unlinked_passenger_trips))

total_trips_NYC_subway_2024

# A tibble: 7 × 2
    mon total_trips
  <dbl>       <dbl>
1     1   157917292
2     2   155722143
3     3   172235628
4     4   171458466
5     5   180458819
6     6   159891215
7     7   161133245

total_trips_NYC_subway <- total_trips_NYC_subway_2024 |>
  filter(mon == 5) |>
  pull(total_trips)

There were about 180.46 million total trips taken on the New York City Subway in May 2024. From the data table total_trips_NYC_subway_2024, it is evident that the average monthly ridership from January 2024 to April 2024, was less than the total trips in May 2024. This could be due to the season change, as the weather gets warmer from Winter into the Spring.

Question #4

ridership_drop_NYC_april <- USAGE_clean |>
  filter(Agency == "MTA New York City Transit", Mode == "Heavy Rail", month(month)==4) |>
  filter((year(month)==2019) | (year(month)==2020)) |>
  group_by(year = year(month)) |>
  summarize(total_ridership = sum(unlinked_passenger_trips)) |>
  arrange(year) |>
  summarize(ridership_drop = first(total_ridership)-last(total_ridership)) |>
  pull(ridership_drop)

ridership_drop_NYC_april

[1] 211969660

From April 2019 to April 2020, the NYC subway ridership fell by approxmately 211.97 million. The drastic drop in NYC subway ridership was likely due to the COVID-19 global pandemic which forced everyone to remain indoors and socially distance from others. So, it is understandable that such an extreme drop of 211.97 million occurred.

After analyzing the above statistics from the monthly ridership table, I wanted to find some additional information on the possible impact of COVID-19 on transit system ridership in the US.

Transit System Patterns Before and During COVID-19 Analysis

Several years ago, in 2020, the world experienced a global outbreak of COVID-19, impacting local and international economies. This event encouraged social distancing and prevented any significant travel for citizens besides essential workers. I was curious to see how our national transit systems were affected by this global event. More specifically, I wanted to explore the ridership changes for the various metro areas included in the national ridership data set. I created a data frame with the total ridership of each metro area in this time frame, then calculated the percent change of ridership from 2019 to 2020. I expected to find a decrease in ridership in all the cities from 2019 to 2020 and wanted to explore which cities were negatively impacted the most and least.

ridership_by_area_2019 <- USAGE_clean |>
  filter(year(month) == 2019) |>
  group_by(metro_area) |>
  summarize(total_ridership = sum(unlinked_passenger_trips)) |>
  ungroup()

ridership_by_area_2020 <- USAGE_clean |>
  filter(year(month) == 2020) |>
  group_by(metro_area) |>
  summarize(total_ridership = sum(unlinked_passenger_trips)) |>
  ungroup()

ridership_2019_2020 <- left_join(ridership_by_area_2019, ridership_by_area_2020, join_by(metro_area)) |>
  rename(total_ridership_2019 = total_ridership.x, total_ridership_2020 = total_ridership.y) |>
  mutate(change_in_ridership = total_ridership_2020-total_ridership_2019, percent_change = round(((change_in_ridership / total_ridership_2019) * 100), digits = 2), decrease = (percent_change < 0))

ridership_2019_2020 |> 
    DT::datatable(rownames = FALSE)

Greatest Decrease in Ridership from 2019-2020

greatest_ridership_drop_place <- ridership_2019_2020 |>
  slice_min(percent_change, n=1) |>
  pull(metro_area)

greatest_ridership_drop_percent <- ridership_2019_2020 |>
  slice_min(percent_change, n=1) |>
  pull(percent_change)

greatest_ridership_drop_place

[1] "Rome, GA"

greatest_ridership_drop_percent

[1] -91.56

After looking through the data, I found that the city with the greatest ridership drop from 2019 to 2020 was Rome, GA with a percentage change of -91.56%.

Furthermore, I wanted to explore more about the ridership in Rome, GA and analyze the breakdown of ridership by mode of transportation.

ridership_Rome_GA_2019 <- USAGE_clean |>
  filter(year(month) == 2019, metro_area == "Rome, GA") |>
  group_by(Mode) |>
  summarize(total_ridership = sum(unlinked_passenger_trips)) |>
  ungroup()

ridership_Rome_GA_2020 <- USAGE_clean |>
  filter(year(month) == 2020, metro_area == "Rome, GA") |>
  group_by(Mode) |>
  summarize(total_ridership = sum(unlinked_passenger_trips)) |>
  ungroup()

ridership_Rome_GA_2019_2020 <- left_join(ridership_Rome_GA_2019, ridership_Rome_GA_2020, join_by(Mode)) |>
  rename(total_ridership_2019 = total_ridership.x, total_ridership_2020 = total_ridership.y) |>
  mutate(change_in_ridership = total_ridership_2020-total_ridership_2019, percent_change = round(((change_in_ridership / total_ridership_2019) * 100), digits = 2), decrease = (percent_change < 0))

ridership_Rome_GA_2019_2020 |>
  DT::datatable(rownames = FALSE)

From this table above, there were only two recorded modes of transportation from Rome, GA: bus and demand response. The bus ridership changed by -92.75%, while the demand response ridership changed by -36.57%. The sharp decline in bus ridership could be due to the fear of taking public mass transit. Individuals needing to travel locally would probably have preferred utilizing personal vehicles. Overall, the decrease in ridership was typical for majority of metropolitan areas with transit systems.

Exploring Peculiar Increases in Ridership from 2019-2020

Subsequently, I was also curious to see which city was affected the least. When exploring this question, contrary to my assumption, there were a few cities that actually increased their ridership from 2019 to 2020.

positive_change <- ridership_2019_2020 |>
  filter(decrease == FALSE) |>
  arrange(desc(percent_change))

positive_change_cities <- positive_change$metro_area

positive_change_cities

[1] "Victoria, TX"       "Port St. Lucie, FL" "Las Cruces, NM"

The cities that presumably increased their ridership from 2019 to 2020 were Victoria, TX, Port St. Lucie, FL, Las Cruces, NM.

Important

Although finding an increase in ridership from 2019 to 2020 is not entirely impossible, it was probably unlikely to occur. So, it was important to do additional investigation to figure out why these calculations came out to be positive.

Upon further review of the data for Victoria, TX, it is apparent that no data was collected between January 2019 to August 2019, causing the total ridership for 2019 to be significantly lower than the totals from 2020.

tx_2019 <- USAGE_clean |>
  filter(year(month) == 2019, 
         metro_area == "Victoria, TX") |>
  group_by(month) |>
  summarize(total_UPT = sum(unlinked_passenger_trips)) |>
  ungroup()

tx_2020 <- USAGE_clean |>
  filter(year(month) == 2020, 
         metro_area == "Victoria, TX") |>
  group_by(month) |>
  summarize(total_UPT = sum(unlinked_passenger_trips)) |>
  ungroup()

tx_combined <- full_join(tx_2019, tx_2020) |>
  mutate(year = year(month),
         mon = month(month)) |>
  select(-month) |>
  pivot_wider(id_cols = c(mon),
              names_from = year,
              values_from = total_UPT) |>
  arrange(mon)

Joining with `by = join_by(month, total_UPT)`

tx_combined |>
  DT::datatable(rownames = FALSE)

tx_2019_total <- tx_combined$`2019` |>
  replace(is.na(tx_combined$`2019`), 0) |>
  sum()

tx_2019_total

[1] 107199

tx_2020_total <- tx_combined$`2020` |>
  sum()

tx_2020_total

[1] 269979

The total for Victoria, TX in 2019 was 107,199 compared to 2020 with a total of 269,979, a significant difference due to the missing data from 2019. Being that the information for 2019 is not entirely accessible, it is inconclusive whether the ridership in Victoria, TX increased or decreased from 2019 to 2020.

Contrary to Victoria, TX, the data for Las Cruces, NM was not nearly as drastic. Although, there seems to be an unexpected value for the total ridership in January 2019.

nm_2019 <- USAGE_clean |>
  filter(year(month) == 2019, 
         metro_area == "Las Cruces, NM") |>
  group_by(month) |>
  summarize(total_UPT = sum(unlinked_passenger_trips)) |>
  ungroup()

nm_2020 <- USAGE_clean |>
  filter(year(month) == 2020, 
         metro_area == "Las Cruces, NM") |>
  group_by(month) |>
  summarize(total_UPT = sum(unlinked_passenger_trips)) |>
  ungroup()

nm_combined <- full_join(nm_2019, nm_2020) |>
  mutate(year = year(month),
         mon = month(month)) |>
  select(-month) |>
  pivot_wider(id_cols = c(mon),
              names_from = year,
              values_from = total_UPT)

Joining with `by = join_by(month, total_UPT)`

nm_combined |>
  DT::datatable(rownames = FALSE)

jan_2019_nm <- nm_combined |>
  filter(mon == 1) |>
  pull(`2019`)

nm_combined_no_jan <- nm_combined |>
  filter(mon != 1)

nm_2019_min <- nm_combined_no_jan$`2019` |>
  min()

nm_2019_max <- nm_combined_no_jan$`2019` |>
  max()

From the table above, Las Cruces, NM, experienced a total ridership of 532 in January 2019, which seems highly unlikely given that the range of values for the remainder of 2019 were between 4,510 and 9,544. Observing the remaining data points, there’s a general increase in ridership from January 2019 until about January 2020. The Las Cruces, NM metro area experiences a steep ridership drop in April 2020, right around the peak of the pandemic. Overall, it seems that the ridership from January 2019 may have been undercounted, so it is unclear whether Las Cruces, NM experienced an increase or decrease in ridership from 2019 to 2020.

Lastly, for Port St. Lucie, FL, there is less of a concern for the 2019 data as the monthly ridership totals seem to fluctuate in the range of 14,000 to 18,000. However, in 2020, there is a noticeable increase in ridership in October 2020.

fl_2019 <- USAGE_clean |>
  filter(year(month) == 2019, 
         metro_area == "Port St. Lucie, FL") |>
  group_by(month) |>
  summarize(total_UPT = sum(unlinked_passenger_trips)) |>
  ungroup()

fl_2020 <- USAGE_clean |>
  filter(year(month) == 2020, 
         metro_area == "Port St. Lucie, FL") |>
  group_by(month) |>
  summarize(total_UPT = sum(unlinked_passenger_trips)) |>
  ungroup()

fl_combined <- full_join(fl_2019, fl_2020) |>
  mutate(year = year(month),
         mon = month(month)) |>
  select(-month) |>
  pivot_wider(id_cols = c(mon),
              names_from = year,
              values_from = total_UPT)

Joining with `by = join_by(month, total_UPT)`

fl_combined |>
  DT::datatable(rownames = FALSE)

sept_2020_fl <- fl_combined |>
  filter(mon == 9) |>
  pull(`2020`)

oct_2020_fl <- fl_combined |>
  filter(mon == 10) |>
  pull(`2020`)

fl_increase <- oct_2020_fl - sept_2020_fl

fl_increase_percent <- round((fl_increase / sept_2020_fl) * 100, digits = 0)

In September 2020, Port St. Lucie, FL experienced a ridership total of 6,883. In only one month, Port St. Lucie, FL increased its ridership by 43,535 (or 633%) to a total ridership in October 2020 of 50,418. While this seems a little unbelievable, it is possible that during the latter part of 2020, people were beginning to travel domestically again. Perhaps more people were fleeing bigger urban areas for smaller cities like Port St. Lucie, FL causing a spike in ridership in October 2020 and beyond. This case seems more reasonable, especially since the ridership in Port St. Lucie, FL after October 2020 maintained this high total ridership fluctuating between 46,000 and 52,000 UPT.

Although in most cases, it is unlikely that ridership would have increased from 2019 to 2020, it is possible that certain, more suburban, areas could have had an increase in ridership. One possibility is that people wanted to relocate from more densely populated cities into smaller cities, when travel became more accessible, causing an increase in ridership in cities like Port St. Lucie, FL.

New York City Transit During COVID-19

Aside from the above three cities, the large majority of US city transit systems experienced a decrease in ridership. Being that New York City has the largest transit system in the United States, the Metropolitan Transit Authority (MTA), I was curious to further analyze the changes the city’s transit system experienced from 2019 to 2020.

ridership_NYC_2019 <- USAGE_clean |>
  filter(year(month) == 2019, Agency == "MTA New York City Transit") |>
  group_by(Mode) |>
  summarize(total_ridership = sum(unlinked_passenger_trips)) |>
  ungroup()

ridership_NYC_2020 <- USAGE_clean |>
  filter(year(month) == 2020, Agency == "MTA New York City Transit") |>
  group_by(Mode) |>
  summarize(total_ridership = sum(unlinked_passenger_trips)) |>
  ungroup()

ridership_NYC_2019_2020 <- left_join(ridership_NYC_2019, ridership_NYC_2020, join_by(Mode)) |>
  rename(total_ridership_2019 = total_ridership.x, total_ridership_2020 = total_ridership.y) |>
  mutate(change_in_ridership = total_ridership_2020-total_ridership_2019, percent_change = round(((change_in_ridership / total_ridership_2019) * 100), digits = 2), decrease = (percent_change < 0))


ridership_NYC_2019_2020 |>
  DT::datatable(rownames = FALSE)

nyc_change <- round((sum(ridership_NYC_2019_2020$change_in_ridership) / sum(ridership_NYC_2019_2020$total_ridership_2019)) * 100, digits = 0)

nyc_change

[1] -56

As expected, all 5 modes of transportation (bus, bus rapid transit, commuter bus, demand response, and heavy rail) contributed to the overall -56% change in ridership in NYC from 2019 to 2020. During this time, many companies transitioned to remote work allowing employees to work from home and all students transitioned to virtual learning, avoiding travel and public transportation. Additionally, the fear of contracting and spreading the virus led more individuals who needed to travel to use personal vehicles rather than public transport. All of these factors contributed to the severe decrease in ridership in NYC’s transit system in 2020.

ridership_by_area_2020 <- USAGE_clean |>
  filter(year(month) == 2020) |>
  group_by(metro_area) |>
  summarize(total_ridership = sum(unlinked_passenger_trips)) |>
  ungroup()

ridership_by_area_2023 <- USAGE_clean |>
  filter(year(month) == 2023) |>
  group_by(metro_area) |>
  summarize(total_ridership = sum(unlinked_passenger_trips)) |>
  ungroup()

ridership_2020_2023 <- left_join(ridership_by_area_2020, ridership_by_area_2023, join_by(metro_area)) |>
  rename(total_ridership_2020 = total_ridership.x, total_ridership_2023 = total_ridership.y) |>
  mutate(change_in_ridership = total_ridership_2023-total_ridership_2020, percent_change = round(((change_in_ridership / total_ridership_2020) * 100), digits = 2), increase = (percent_change >= 0))

ridership_2020_2023 |>
  DT::datatable(rownames = FALSE,
                options = list(pageLength = 5))

percentage_positive <- ridership_2020_2023 |>
  drop_na() |>
  mutate(total = n()) |>
  group_by(increase, total) |>
  summarize(count = n()) |>
  mutate(percent = round((count / total)*100, digits = 0)) |>
  filter(increase == TRUE) |>
  pull(percent)

`summarise()` has grouped output by 'increase'. You can override using the
`.groups` argument.

percentage_positive

[1] 91

Overall, it was interesting to observe the varying effects that COVID-19 had on US transit systems across different metropolitan areas. Each city reacted and adjusted to the global pandemic differently, which led to different patterns of changes in their respective transit systems. Although most transit systems experienced a severe drop in ridership during 2020, it is evident that most transit systems have progressed in the right direction, with approximately 91% of cities demonstrating an increase in ridership from 2020 to 2023.

2022 Financial and Ridership Analysis (Major Transit Systems)

Since the focus of the remaining analysis is on the financials and ridership of the US major transit systems in 2022, the table USAGE must be filtered for data in 2022 and then merged with the FINANCIALS table from earlier. Additionally, to filter for major transit systems, the new table USAGE_AND_FINANCIALS is filtered to only contain data with total unlinked passenger trips greater than 400,000.

USAGE_2022_ANNUAL <- USAGE |>
  filter(year(month) == 2022) |>
  group_by(`NTD ID`, Agency, metro_area, Mode) |>
  summarize(UPT = sum(UPT), VRM = sum(VRM)) |>
  ungroup()

`summarise()` has grouped output by 'NTD ID', 'Agency', 'metro_area'. You can
override using the `.groups` argument.

FINANCIALS <- FINANCIALS |>
  mutate(Mode = case_when(
    Mode == "DR" ~ "Demand Response",
    Mode == "FB" ~ "Ferryboat",
    Mode == "MB" ~ "Bus",
    Mode == "SR" ~ "Streetcar",
    Mode == "TB" ~ "Trolleybus",
    Mode == "VP" ~ "Vanpool",
    Mode == "CB" ~ "Commuter Bus",
    Mode == "RB" ~ "Bus Rapid Transit",
    Mode == "LR" ~ "Light Rail",
    Mode == "YR" ~ "Hybrid Rail",
    Mode == "MG" ~ "Monorail/Automated Guideway Transit",
    Mode == "CR" ~ "Commuter Rail",
    Mode == "AR" ~ "Alaska Railroad",
    Mode == "TR" ~ "Aerial Tramways",
    Mode == "HR" ~ "Heavy Rail",
    Mode == "IP" ~ "Inclined Plane",
    Mode == "PB" ~ "Publico",
    Mode == "CC" ~ "Cable Car",
    TRUE ~ "Unknown"
  ))

USAGE_AND_FINANCIALS <- left_join(USAGE_2022_ANNUAL, 
           FINANCIALS, 
           join_by(`NTD ID`, Mode)) |>
    drop_na()

USAGE_AND_FINANCIALS_major_transit <- USAGE_AND_FINANCIALS |>
  filter(UPT >= 400000)

USAGE_AND_FINANCIALS_major_transit |>
  DT::datatable(rownames = FALSE,
                options = list(pageLength = 5))

Efficiency involves optimizing productivity while minimizing expense. In the below analysis, I take a look at various efficiency metrics to determine which United States transit system was the most efficient in 2022.

Efficiency Analysis:

Below are various metrics that can be used to describe the efficiency of transit systems:

Which transit system (agency and mode) had the most UPT in 2022?
Which transit system (agency and mode) had the highest farebox recovery, defined as the highest ratio of Total Fares to Expenses?
Which transit system (agency and mode) has the lowest expenses per UPT?
Which transit system (agency and mode) has the highest total fares per UPT?
Which transit system (agency and mode) has the lowest expenses per VRM?
Which transit system (agency and mode) has the highest total fares per VRM?

Question #1

transit_system_most_UPT_agency <- USAGE_AND_FINANCIALS_major_transit |>
  slice_max(UPT, n = 1) |>
  pull(Agency)

transit_system_most_UPT_agency

[1] "MTA New York City Transit"

transit_system_most_UPT_mode <- USAGE_AND_FINANCIALS_major_transit |>
  slice_max(UPT, n = 1) |>
  pull(Mode)

transit_system_most_UPT_mode

[1] "Heavy Rail"

transit_system_most_UPT <- USAGE_AND_FINANCIALS_major_transit |>
  slice_max(UPT, n = 1) |>
  pull(UPT)
  
transit_system_most_UPT

[1] 1793073801

In 2022, the transit system with the highest amount of unlinked passenger trips (UPT) of 1.79 billion was the MTA New York City Transit with the mode of transportation of Heavy Rail. Since MTA New York City Transit subway system is the largest in the US, it is clear that they would have the highest UPT among all major US transit systems.

Question #2

highest_farebox_recovery_agency <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(farebox_recovery = `Total Fares` / Expenses) |>
  slice_max(farebox_recovery, n = 1) |>
  pull(Agency)

highest_farebox_recovery_agency

[1] "Port Imperial Ferry Corporation"

highest_farebox_recovery_mode <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(farebox_recovery = `Total Fares` / Expenses) |>
  slice_max(farebox_recovery, n = 1) |>
  pull(Mode)

highest_farebox_recovery_mode

[1] "Ferryboat"

highest_farebox_recovery <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(farebox_recovery = `Total Fares` / Expenses) |>
  slice_max(farebox_recovery, n = 1) |>
  pull(farebox_recovery)

highest_farebox_recovery

[1] 1.428146

The transit system with the highest farebox recovery of 1.43 (ratio of total fares to expenses) in 2022 was Port Imperial Ferry Corporation with the mode of transportation of Ferryboat. It is interesting to see that the transit system with the highest farebox recovery is not a popular agency or mode most people would attribute “public transportation” as. However, it does show that efficient transit systems can exist anywhere in the US on any mode of transit.

Question #3

lowest_expenses_per_UPT_agency <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(expenses_per_UPT = Expenses / UPT) |>
  slice_min(expenses_per_UPT, n = 1) |>
  pull(Agency)

lowest_expenses_per_UPT_agency

[1] "North Carolina State University"

lowest_expenses_per_UPT_mode <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(expenses_per_UPT = Expenses / UPT) |>
  slice_min(expenses_per_UPT, n = 1) |>
  pull(Mode)

lowest_expenses_per_UPT_mode

[1] "Bus"

lowest_expenses_per_UPT <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(expenses_per_UPT = Expenses / UPT) |>
  slice_min(expenses_per_UPT, n = 1) |>
  pull(expenses_per_UPT)

lowest_expenses_per_UPT

[1] 1.17912

The transit system that had the lowest expenses per UPT (1.18) in 2022 was Bus from North Carolina State University. Although we do not have all the information about the cost of university transit, it is likely that such a large university is transporting sizeable amounts of students daily which could contribute to its low expense per UPT.

Question #4

highest_fares_per_UPT_agency <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(fares_per_UPT = `Total Fares` / UPT) |>
  slice_max(fares_per_UPT, n = 1) |>
  pull(Agency)

highest_fares_per_UPT_agency

[1] "Hampton Jitney, Inc."

highest_fares_per_UPT_mode <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(fares_per_UPT = `Total Fares` / UPT) |>
  slice_max(fares_per_UPT, n = 1) |>
  pull(Mode)

highest_fares_per_UPT_mode

[1] "Commuter Bus"

highest_fares_per_UPT <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(fares_per_UPT = `Total Fares` / UPT) |>
  slice_max(fares_per_UPT, n = 1) |>
  pull(fares_per_UPT)

highest_fares_per_UPT

[1] 41.29628

Hampton Jitney, Inc. with the transit mode of Commuter Bus had the highest total fares per UPT of 41.3 in 2022. Since Hampton Jitney, Inc. is a commuter bus company that provides coach bus transportation, charter bus and tour bus options, it is likely that passengers are charged higher fares, leading to a high total fare per UPT.

Question #5

lowest_expenses_per_VRM_agency <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(expenses_per_VRM = Expenses / VRM) |>
  slice_min(expenses_per_VRM, n = 1) |>
  pull(Agency)

lowest_expenses_per_VRM_agency

[1] "Metropolitan Transportation Commission"

lowest_expenses_per_VRM_mode <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(expenses_per_VRM = Expenses / VRM) |>
  slice_min(expenses_per_VRM, n = 1) |>
  pull(Mode)

lowest_expenses_per_VRM_mode

[1] "Vanpool"

lowest_expenses_per_VRM <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(expenses_per_VRM = Expenses / VRM) |>
  slice_min(expenses_per_VRM, n = 1) |>
  pull(expenses_per_VRM)

lowest_expenses_per_VRM

[1] 0.4449998

The transit system with the lowest expenses per vehicle revenue mile (VRM) of 0.44 in 2022 was Vanpool from Metropolitan Transportation Commission. Since Vanpool is a form of transportation similar to rideshares, transporting a significantly less amount of people than typical mass transit in a smaller vehicle, the expenses required to operate and maintain it is likely significantly less. Thus, it is understandable that this mode of transportation would have the lowest expenses per VRM.

Question #6

highest_fares_per_VRM_agency <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(fares_per_VRM = `Total Fares` / VRM) |>
  slice_max(fares_per_VRM, n = 1) |>
  pull(Agency)

highest_fares_per_VRM_agency

[1] "Jacksonville Transportation Authority"

highest_fares_per_VRM_mode <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(fares_per_VRM = `Total Fares` / VRM) |>
  slice_max(fares_per_VRM, n = 1) |>
  pull(Mode)

highest_fares_per_VRM_mode

[1] "Ferryboat"

highest_fares_per_VRM <- USAGE_AND_FINANCIALS_major_transit |>
  mutate(fares_per_VRM = `Total Fares` / VRM) |>
  slice_max(fares_per_VRM, n = 1) |>
  pull(fares_per_VRM)

highest_fares_per_VRM

[1] 157.7002

Lastly, Ferryboat from Jacksonville Transportation Authority had the highest total fares per vehicle revenue mile (VRM) of 157.7 in 2022. The Jacksonville Transportation Authority provides Ferryboat transportation not only for passengers but also charges extra for those who want to transport various vehicles which could contribute to the high fares per VRM.

Conclusion

Transit system efficiency is subjective and can be measured with different metrics (highest UPT, highest farebox recovery, lowest expenses per UPT, highest total fares per UPT, lowest expenses per VRM, and highest total fares per VRM). Based on the above information, using the farebox recovery ratio, I found that the most efficient transit system in the country in 2022 was the Ferryboat from the Port Imperial Ferry Corporation. This transit system was able to maximize its total fares while minimizing their expenses leading to the highest farebox recovery ratio of 1.43.

Overall, the National Transit Database provided a lot of useful and insightful data, allowing for a wide range of analysis. More specifically, it was interesting to explore historical transit data leading to thought-provoking insights as well as the various metrics to determine effective and efficient transportation in the United States. As the world continues to accelerate post-COVID, the major US public transit systems will continue to improve and create more efficient means of transportation for commuters around the nation.