igomeza

Exploratory Analysis of Air Quality in New York City (1973)

This project presents a detailed Exploratory Data Analysis (EDA) of the R airquality dataset, which contains daily measurements of air quality in New York City between May and September 1973. The main objective is to understand the distribution, seasonal trends and relationships between key variables such as ozone, solar radiation, wind speed and temperature.

🎯 Project objectives

🗃️ Dataset: The airquality dataset is a built-in dataset in R that records daily air quality measurements in New York. It includes the following key variables:

Ozone data were obtained from the New York State Department of Conservation, and meteorological data were obtained from the National Weather Service.

Loading and initial inspection

install.packages("dplyr") #for data manipulation and transformation
install.packages("tidyr") #for data cleaning and restructuring
install.packages("ggplot2") #for the creation of high-quality visualizations
library(dplyr)
library(tidyr)
library(ggplot2)

data_air <- airquality #assign the dataset in a variable

dim(data_air) #dataset size

head(airquality) #first rows in data

head

summary(data_air) #summary of variables in data

summary

Key findings by variable:

Data Cleaning: Duplicate verification and missing values (NAs) management strategy. We opted for the elimination of complete rows with NAs (na.omit()) to ensure reliability in the correlation analysis, resulting in a clean dataframe (data_air_clean) with 111 observations.

num_duplicated <- sum(duplicated(data_air)) #verify duplicated rows
print(num_duplicated)

This dataset has no rows with duplicate data.

colSums(is.na(data_air)) #Counting NA's by column

Removal of entire rows (na.omit()):

data_air_clean <- na.omit(data_air) # Remove all rows containing at least one NA

dim(data_air_clean) # Check dimensions of new data frame

A new dataframe called data_air_clean was created. Out of 153 original rows, we are left with 111 rows. That’s a loss of 42 rows (37 from Ozone + 5 additional rows where Solar.R had NA and Ozone did not).

colSums(is.na(data_air_clean)) # Check again for presence of NA's to confirm cleanup
summary(data_air_clean) # Verify again the summary 

summary_2

Descriptive Statistics

numeric_cols_clean <- c("Ozone", "Solar.R", "Wind", "Temp") # Define the numerical columns of interest
descriptive_stats_clean <- data_air_clean %>% # Calculate the descriptive statistics for each column of the clean dataframe.
  select(all_of(numeric_cols_clean)) %>% # Select only the numerical columns that interest us
  summarise( # Summarize each column
    # Ozone
    Ozone_Mean = mean(Ozone, na.rm = TRUE),
    Ozone_Median = median(Ozone, na.rm = TRUE),
    Ozone_SD = sd(Ozone, na.rm = TRUE),
    Ozone_N = n(),
    
    # Solar radiation
    SolarR_Mean = mean(Solar.R, na.rm = TRUE),
    SolarR_Median = median(Solar.R, na.rm = TRUE),
    SolarR_SD = sd(Solar.R, na.rm = TRUE),
    SolarR_N = n(),

    # Wind
    Wind_Mean = mean(Wind, na.rm = TRUE),
    Wind_Median = median(Wind, na.rm = TRUE),
    Wind_SD = sd(Wind, na.rm = TRUE),
    Wind_N = n(),

    # Temperature
    Temp_Mean = mean(Temp, na.rm = TRUE),
    Temp_Median = median(Temp, na.rm = TRUE),
    Temp_SD = sd(Temp, na.rm = TRUE),
    Temp_N = n()
  ) %>%
  # Use pivot_longer to transform the table from width to length
  pivot_longer(
    cols = everything(), # Select all columns
    names_to = c("Variable", ".value"), # Split the names in ‘Variable’ and the type of statistic
    names_pattern = "(.+)_(Mean|Median|SD|N)" # regex pattern to extract the variable and the statistic
  )

print(descriptive_stats_clean) # Show the resulting table

statistics

Exploratory Data Analysis (EDA)

Ozone vs. temperature relationship scatter plot showing a positive correlation; the higher the temperature, the higher the ozone tends to increase.

ggplot(data_air_clean, aes(x = Temp, y = Ozone)) +
  geom_point(alpha = 0.6, color = "darkblue") + 
  geom_smooth(method = "lm", se = FALSE, color = "red") + # Add a linear regression line
  labs(title = "Ozone vs. Temperature",
       x = "Temperature (°F)",
       y = "Ozone (ppb)") +
  theme_minimal()

1 ozone_vs_temperature

Ozone vs. wind relationship Scatter plot showing a negative correlation; the higher the wind speed, the lower the ozone tends to decrease due to the dispersion of pollutants.

ggplot(data_air_clean, aes(x = Wind, y = Ozone)) +
  geom_point(alpha = 0.6, color = "darkgreen") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Ozone vs. wind speed",
       x = "wind speed (mph)",
       y = "Ozone (ppb)") +
  theme_minimal()

2 ozone_vs_wind-speed

Ozone vs. Solar radiation Scatter plot indicating a positive relationship; more solar radiation is associated with higher ozone concentrations, a key factor in ozone formation.

ggplot(data_air_clean, aes(x = Solar.R, y = Ozone)) +
  geom_point(alpha = 0.6, color = "orange") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Ozone vs. Solar radiation",
       x = "Solar radiation (langleys)",
       y = "Ozone (ppb)") +
  theme_minimal()

3 ozone_vs_solar-radiation

Average ozone per month Line graph showing a significant spike in average ozone levels during July and August.

data_air_clean %>%
  group_by(Month) %>%
  summarise(Avg_Ozone = mean(Ozone, na.rm = TRUE)) %>%
  ggplot(aes(x = Month, y = Avg_Ozone)) +
  geom_line(color = "purple", size = 1.2) +
  geom_point(color = "purple", size = 3) +
  labs(title = "Ozone average per month",
       x = "Month",
       y = "Ozone average (ppb)") +
  scale_x_continuous(breaks = 5:9, labels = c("May", "June", "July", "August", "September")) +
  theme_minimal()

4 ozone_average_per_month

Monthly Ozone Distribution Boxplots detailing the distribution of ozone by month, confirming higher levels and greater variability in summer, and the presence of outliers.

ggplot(data_air_clean, aes(x = factor(Month), y = Ozone, fill = factor(Month))) +
  geom_boxplot(na.rm = TRUE) +
  labs(title = "Ozone distribution by month",
       x = "Month",
       y = "Ozone (ppb)") +  
  scale_x_discrete(labels = c("May", "June", "July", "August", "September")) +
  theme_minimal() +
  guides(fill = "none")

5 ozone_distribution_by_month

📈 Conclusions and Key Findings:

📄 References:


Developed by: Gómez-Alonso, I.S. Date: June 13, 2025