---
title: "04_DataWrangling"
author: "Zofia Baranczuk"
date: "2025-07-31"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


## 0. Prepare  Packages 
Install.packages(c("dplyr","tidyr","readr","here","stringr", "lme4")) # run once
```{r}
library(dplyr)    # core data manipulation
library(tidyr)    # reshaping 
library(readr)    # used if we load CSV 
library(here)     # project-safe paths (used when working in a project)
library(stringr)  # practical string ops for filtering, etc.
```


## 1. group_by and summarize.
For each participant of the sleep study, show their mean reaction time and the standard deviation of the reaction time. 
```{r}
library(lme4) # to have access to sleepstudy data

data("sleepstudy") 
#View(sleepstudy) #to have a look at the data, but comment out when compiling.

summary_sleep <- sleepstudy %>% # the pipe operator
  # it takes the output of the expression on its left 
  # and passes it as the first argument to the function on its right.
  
  # Side note: in base R, since 4.1, the pipe operator in base R |>
  #x <- c(1,3,4)
  #mean(x) equivalent to x |> mean()
  
  group_by(Subject) %>% #Take the data and group by Subject 
  summarise(
    mean_reaction = mean(Reaction), # and then summarize by the mean of Reaction time.
    sd_reaction = sd(Reaction), # and by the standard deviation of the reaction time 
    count_by_subject = n() # number of observation in a given group
    )

summary_sleep

# Task: Compute the minimum and maximum reaction time per subject.

```



## 2. mutate()

```{r}
ms <- sleepstudy %>% 
  mutate(Reaction_rounded = round(Reaction,0)) 
# we can add multiple new columns
# Here: reaction time rounded to 0 decimal places

head(ms)
#Task: Create a new column with z-scores of reaction time:
#(Reaction - mean(Reaction)) / sd(Reaction)
```


## 3. filter()
Select reaction times of Subjects below 311 quicker than 300. 
Multiple conditions in filter mean "and". In R, & (or , in this case) is “and”, | is “or”, ! is “not”, and xor() is exclusive or.
```{r}
(fast <- filter(sleepstudy, as.numeric(as.character(Subject)) < 311, Reaction <300) ) 
# parenthesis - the result will be also shown. 
# as.numeric(as.character(Subject)) - Subject is a factor variable. 
# It cannot be compared with a number
# Task: 1. Check what happens if we use only as.numeric(Subject)
# 2. Find rows where Reaction > 400 OR Days > 5.
# 

```



## 4. Pivoting.
Data is often organised to facilitate some use other than analysis. For example, data is often organised to make entry as easy as possible. A typical example is the long format vs. the wide format.
```{r}
GDP <- read_csv(here("Data", "GDP.csv"))
GDP_recent <- GDP %>% select(`Country Code`, `Country Name`, `2022`, `2023`,`2024`)
#View(GDP_recent)

GDP_long <- GDP_recent %>% pivot_longer(c(`2022`, `2023`, `2024`), 
                                        names_to = "year", values_to = "GDP")
#View(GDP_long)
```

pivot_wider() is the opposite of pivot_longer. Let's see an example with the sleep study again.

```{r}
sleep_wide <- sleepstudy %>% 
  pivot_wider(names_from = Days, values_from = Reaction)
sleep_wide
#View(sleepstudy)
#View(sleep_wide)
```

## 5. Merging

Often we might have data in different files (e.g. behavioral + MRI), we might need to merge them by ID. Below an example with countries. 

inner_join() – intersection
left_join() – keep all rows from the left table
full_join() – keep all rows from both tables (fill with NAs)
anti_join() – rows in x with no match in y
```{r}
country <- read_csv(here("Data", "Country_region.csv"))

GDP_regions <- left_join(GDP_recent, country)
#View(GDP_regions) # let's sat we are interested only in GDP of countries, not the whole regions. 
#So we want to remove rows representing regions. We can recognize it by having "NA" as the region.
GDP_regionsClean <- GDP_regions %>% filter(!is.na(Region))
#View(GDP_regionsClean)


summary_GDP <- GDP_regionsClean %>%
  drop_na(`2023`) %>%
  group_by(Region) %>%
  summarise(mean_2023 = mean(`2023`), median_2023 = median(`2023`),
            min_2023 = min(`2023`), max_2023 = max(`2023`),
            country_min = `Country Name`[which.min(`2023`)],
            country_max = `Country Name`[which.max(`2023`)],
            n_non_na = n())

summary_GDP

# Task: Use anti_join() to check if some Country Codes appear in country but not GDP_recent.

```