---
title: "Team Payroll and the World Series"
author: "Michael Friendly"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Team Payroll and the World Series}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r nomessages, echo = FALSE}
# set some default options for chunks
knitr::opts_chunk$set(
  warning = FALSE,   # avoid warnings and messages in the output
  message = FALSE,
  collapse = TRUE,   # collapse all output into a single block
  tidy = FALSE,      # don't tidy our code-- assume we do it ourselves
  fig.height = 5,
  fig.width = 5
)
options(digits=4)    # number of digits to display in output; can override with chunk option R.options=list(digits=)
par(mar=c(3,3,1,1)+.1)

set.seed(1234)       # reproducibility
```

```{r load-packages, echo=FALSE}
library(Lahman)   # Load additional packages here 
library(ggplot2)
library(dplyr)
```

This vignette examines whether there is a relationship between total team salaries (payroll) and World Series success.
It was inspired by Nolan & Lang (2015), "Baseball: Exploring Data in a Relational Database", 
Chapter 10 in [*Data Science in R*](https://www.routledge.com/Data-Science-in-R-A-Case-Studies-Approach-to-Computational-Reasoning-and-Problem-Solving/Nolan-Lang/p/book/9781482234817). They use `SQL` on the raw Lahman files `.csv`, rather
than the Lahman package.

Here, We largely use `dplyr` for data munging and `ggplot2` for plotting.
In the process, we discover a few errors in the data sets.

## The data files

Start with loading the files we will use here.  We do some pre-processing to make them more convenient for the
analyses done later.

### The `Salaries` data

The `Salaries` data.frame contains data on all players' salaries from 1985-2016 in the latest release, v. `r packageVersion("Lahman")`,
of the `Lahman` package.  (Additions to the `Salaries` table in San Lahman's database were discontinued after 2016.)
We use the `sample_n` function to display a random sample of observations.
```{r}
data("Salaries", package="Lahman")
str(Salaries)
sample_n(Salaries, 10)
```

### The `Teams` data

The `Teams` data.frame contains a lot of information about all teams that have ever played, with a separate observation for each year.
Here, we will mainly use this to get the team name (`team`) from `teamID` and also to get the information about World Series winners.

```{r Teams-names}
data("Teams", package="Lahman")
dim(Teams)
names(Teams)
```

We are only going to use the observations from 1985 on, and a few variables, so we filter and select them now.
Keep only the levels of `teamID` in the data.

```{r Teams-filter}
Teams <- Teams %>%
  select(yearID, lgID, teamID, name, divID, Rank, WSWin, attendance) %>%
  filter(yearID >= 1985) %>%
  mutate(teamID = droplevels(teamID))

sample_n(Teams, 10)
```

### The `SeriesPost` data

Post season records go back to 1884. There are `r nrow(Lahman::SeriesPost)` observations
covering all aspects of post-season play.

```{r SeriesPost-names}
data("SeriesPost", package="Lahman")
names(SeriesPost)
```

For each year, there are number of observations for the various levels of post-season play
(Division titles, League titles, etc. A number of these designations have changed over the years, and
I don't know what they all mean.)

```{r}
table(SeriesPost$round)
```

We are interested only in the World Series (`WS`), which was first played in 1903.
We filter for the years for which we have salary data, and drop a couple of variables.
The league IDs of the winner and loser are factors, so we use `droplevels` to
include only the levels in recent history.

```{r filter-WS}
WS <- SeriesPost %>%
  filter(yearID >= 1985 & round == "WS") %>%
  select(-ties, -round) %>%
  mutate(lgIDloser = droplevels(lgIDloser),
         lgIDwinner = droplevels(lgIDwinner))
dim(WS)
sample_n(WS, 6)
```


## A first look at `Salaries`

How many players do we have in each year?
```{r}
table(Salaries$yearID)
```

What is the range of salaries, across all years?
```{r}
range(Salaries$salary)
```

And, year by year?
```{r}
Salaries %>% 
  group_by(yearID) %>% 
  summarise(min=min(salary),
            max=max(salary))
```
Hmm, there is a `salary==0` in 1993, maybe there are others.

```{r}
which(Salaries$salary==0)
```
Who are they? (We could also look up their `playerID`s in `Lahman::People`.)
```{r which-zero}
Salaries[which(Salaries$salary==0),]
```
These must be errors. Get rid of them. **Reminder**: Check further; maybe file an 
[issue](https://github.com/cdalzell/Lahman/issues)
in the Lahman package!

```{r filter-salaries}
Salaries <- Salaries %>%
    filter(salary !=0)
```


### Get team payrolls

We want to sum the `salary` for each team for each year. We might as well make it in millions.
All those zeros hurt my eyes.


```{r summarise-salaries}
payroll <- Salaries %>%
                group_by(teamID, yearID) %>%
                summarise(payroll = sum(salary)/1000000)
head(payroll)
```

### Merge team names into `payroll`

It will be more convenient to have the team names included in the `payroll` data.frame. 
The `Teams` data frame also contains the `Y/N` indicator `WSWin` for World Series winners,
so we might as well include this too.

```{r merge-teams}
payroll <- merge(payroll, Teams[,c("yearID", "teamID","name", "WSWin")], 
                 by=c("yearID", "teamID")) 
sample_n(payroll, 10)
```

Note that we could also do this using `left_join` in the `dplyr` package.  There is probably a more `tidy` way to 
subset the variables from the `Teams` data set than using `Teams[, c()]`, but, hey-- this works.

```{r eval=FALSE}
left_join(payroll, Teams[,c("yearID", "teamID","name", "WSWin")], 
          by=c("yearID", "teamID")) %>% 
  sample_n(10)
```

`WSWin` is a character variable.  Convert it to a factor.

```{r make-WSWin-factor}
payroll <- payroll %>%
  mutate(WSWin = factor(WSWin))
```

Check the values:

```{r}
table(payroll$WSWin, useNA="ifany")
```
There is something wrong here! There shouldn't be any `NA`s.  We leave this for further study, and another
**Reminder** to file an issue if we figure out what the problem is.

## Boxplots of payroll

Let's look at the distributions of payroll by year.  The observations are teams.
```{r payroll-boxplot, fig.width=7}
boxplot(payroll ~ yearID, data=payroll, ylab="Payroll ($ millions)")
```
What are the outliers?  Are there any teams that crop up repeatedly?
`car::Boxplot` makes this easy, and also returns the labels of the outliers.
We don't load the `car` package, because `car` also contains a `Salary` dataset.

```{r, payroll-Boxplot, fig.width=7}
out <- car::Boxplot(payroll ~ yearID, data=payroll,
             id=list(n=1, 
                     labels=as.character(payroll$teamID)), 
             ylab="Payroll ($ millions)")
```

Most of the outliers are the New York Yankees (`NYA`):
```{r}
table(out)
```


Payroll has obviously increased dramatically over time.  So has the variability across teams. For any modelling,
we would probably want to use `\log(payroll)`.  We might also want to look separately at the
American and National leagues.

### Correcting for inflation
For proper comparisons, we should correct for inflation. Lets do this by scaling salary
back to 1985 dollars,
The data below gives inflation rates for all subsequent years. It comes from Nolan & Lang, extended to 2015
using (https://www.in2013dollars.com/).

```{r}
inflation = c(1,    1.02, 1.06, 1.10, 1.15, 1.21, 
              1.27, 1.30, 1.34, 1.38, 1.42, 1.46, 1.49, 1.51, 1.55, 1.60,
              1.65, 1.67, 1.71, 1.76, 1.82, 1.87, 1.93, 2.00, 1.99, 2.03,
              2.09, 2.13, 2.16, 2.20, 2.20 )

inflation.df <- data.frame(year=1985:2015, inflation)

# plot inflation rate
ggplot(inflation.df, aes(y=inflation, x=year)) +
  geom_point() +
  geom_line() +
  geom_smooth(method="lm")
  
```

This is close enough to linear, that we could use the linear regression predicted
value as a simple computation of the inflation rate.
(A better way, of course, would be to use the actual inflation rate; this would entail
merging `payroll` with `inflation.df` by year, and doing the computation.)

```{r}
infl.lm <- lm(inflation ~ year, data=inflation.df)
(coefs <- coef(infl.lm))
```

Scale `payroll` by dividing by linear prediction of inflation rate, producing `payrollStd`.
```{r}
payroll <- payroll %>%
  mutate(payrollStd = payroll / (coefs[1] + coefs[2] * yearID))
```

Boxplot again, of inflation-adjusted payroll. The increase after 2000 doesn't seem so
large.

```{r, payroll-Boxplot2, fig.width=7}
car::Boxplot(payrollStd ~ yearID, data=payroll,
             id = list(labels=as.character(payroll$teamID)), 
             ylab="Payroll (1985-adjusted $ millions)")
```

## Salaries of World Series winning teams

To what extent are the World Series winners those among the highest in payroll?
A simple way to look at this is to plot the team payrolls across years, and mark
the World Series winner for each year.

This plot shows inflation-adjusted payroll on a log scale to avoid the dominating
influence of the most recent years.  We jitter the points to avoid overplotting,
and use a transparent gray color for the non-winners, red for the winner in each
year.

```{r, payroll-winners-plot, fig.width=7}
Cols <- ifelse(payroll$WSWin=='Y', "red", gray(.7, alpha=0.5))
with(payroll, {
  plot(payrollStd ~ jitter(yearID, 0.5), 
       ylab = "Payroll (inflation-adjusted $ millions)",
       ylim = c(5,125), log = "y",
       xlab = "Year",
       pch = 19, cex = 0.8, col = Cols)
})
with(payroll[payroll$WSWin == 'Y',],
     text(y = payrollStd, x = yearID, labels = teamID, pos = 3, cex = 0.8) )
```

By and large, the World Series winners tend to be in the upper portion of the payrolls for each year.


## Further analyses: Your turn

Here are some questions to provoke further analyses of these data sets.  If you find something interesting, post it in a [Github Gist](https://gist.github.com/) or forward it to Team Lahman as in a [Lahman issue](https://github.com/cdalzell/Lahman/issues).

* Our boxplots used total payroll in millions.  Would the plots be clearer if we plotted payroll on a log scale?

* Follow-up the suggestion to fit a linear model predicting `log(payroll)` from
some of the available predictors.

* Examine the trend of team salaries over time for World Series losers. Does it look the same or different than that for the winning teams?

* What kind of analysis would you do to address this statement: "For teams that make it to the World series, the one with the larger payroll is more likely to win."

* Is there any relationship between the number of team wins in a season and winning the World series?