---
title: "MDS"
output:
  rmarkdown::html_vignette:
    toc: true
    number_sections: true

bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{MDS}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  fig.height = 6, fig.width = 7,
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(biplotEZ)
```

This vignette deals with biplots based on Multidimensional scaling (MDS).

# What is MDS

In general, multidimensional scaling deals with constructing a low dimensional map of $n$ samples such that the Euclidean distances between the samples match a given set of dissimilarities $\mathbf{\Delta}:n \times n$ as closely as possible. Here the focus is on Principal Coordinate Analysis (PCO), an approach based on the singular value decomposition of a matrix. However, the regression biplot provides a general structure for fitting any 2D map of samples with biplot axes.

# Regression biplot

The function `regress` accepts as arguments an object of class `biplot`. The call to the function `biplot` should contain the data set that will be used to construct the biplot axes. The second argument `Z` contains the coordinates of the samples in the MDS map. By default linear regression axes will be fitted to the plot. If $\mathbf{X}$ denote the data matrix, the directions of the biplot axes are found by solving the regression equation

$$
\mathbf{X} = \mathbf{ZH}' + \mathbf{E}.
$$
The matrix $\mathbf{H}'=(\mathbf{Z'Z})^{-1}\mathbf{Z'X}$ and calibration of the axes proceed analogous to equation (2) in the biplotEZ vignette. The coordinates of the marker $/mu$ on biplot axis $j$ is given as follows

$$
p_{\mu} =  \frac{\mu}{\mathbf{h}_{(j)}'\mathbf{h}_{(j)}}\mathbf{h}_{(j)}.
$$

```{r}
biplot(rock) |> 
  regress(Z = MASS::sammon(dist(scale(rock), method="manhattan"))$points) |> 
  plot()
```

More flexibility is provided by approximating the biplot axes using B-splines. A calibrated trajectory represented in a matrix  $\mathbf{H}:m \times 2$ snakes through the samples points $\mathbf{Z}$ such that the marker on the trajectory which is closest to a particular sample gives the predicted value of that sample for the particular variable. To retain smoothness of the trajectory $\mathbf{H}$, B-splines are used. 

```{r}
biplot(rock) |> 
  regress(Z = MASS::sammon(dist(scale(rock), method="manhattan"))$points,
          axes = "splines") |> 
  plot()
```


# Principal Coordinate Analysis (PCO) biplots

Let $\mathbf{\tilde\Delta}$ be the $n \times n$ matrix containing the values $-\frac{1}{2}\delta_{ij}^2$ where $\delta_{ij}$ represent the dissimilarity between objects $i$ and $j$. If it is possible to find a set of coordinates $\mathbf{Y}:n\times m$ where typically $m = n-1$ such that the Euclidean distances between the rows of $\mathbf{Y}$ exactly match the dissimilarities $\delta_{ij}$, the dissimilarity is known as a Euclidean embeddable distance metric. @Gower1982 shows that if the distances $\delta_{ij}$ are Euclidean embeddable, then

$$
(\mathbf{I}-\mathbf{1s}')\mathbf{\tilde\Delta} (\mathbf{I}-\mathbf{s1}')
$$

is positive semi-definite.

The Euclidean representation of the samples is obtained as $\mathbf{Y=V\Lambda}^{\frac{1}{2}}$ where $(\mathbf{I}-\mathbf{1s}')\mathbf{\tilde\Delta} (\mathbf{I}-\mathbf{s1}') = \mathbf{V \Lambda V'}$. Since the coordinates in $\mathbf{Y}$ is already referred to their principal axes with $\mathbf{Y'Y=\Lambda}$, the representation of the samples in a 2D biplot is obtained from the first two columns of $\mathbf{Y}$.

In addition to the exact biplot axis representations discussed in @UB2011 approximate axes can be obtained. Linear axes are fitted with the regression method. The variables in the data matrix $\mathbf{X}:n \times p$ can be represented as biplot axes in the PCO biplot with the sample points $\mathbf{Z=V\Lambda}^{\frac{1}{2}}\mathbf{J}_2$ according to the regression method discussed in section 2 above.

```{r}
biplot(rock, scale = TRUE) |> PCO() |> plot()
```

Using B-splines instead of linear regression provides the user with more flexibility. This is achieved by setting the argument `axes = "splines"`.

```{r}
biplot(rock, scale = TRUE) |> PCO(axes = "splines") |> plot()
```

By default the distance metric used for the analysis is Euclidean distance. The user can also specify the distance matrix `Dmat` as an $n \times n$ matrix of a `dist` object. As illustration of a metric that is Euclidean embeddable, the Clark's distance

$$
\delta_{ij}^2 = \sum_{k=1}^{p}\left(\frac{x_{ik}-x_{jk}}{x_{ik}+x_{jk}}\right)^2
$$

as defined by @GowerRoger2005 is calculated below and used for constructing a PCO biplot. Note that the metric is only defined for strictly positive values, so that the data is scaled to values between $1$ and $2$.

```{r}
Clark.dist <- function(X)
{
  n <- nrow(X)
  p <- ncol(X)
  Dmat <- matrix (0, nrow=n, ncol=n)
  for (i in 1:(n-1))
    for (j in (i+1):n)
      Dmat[i,j] <- sum(((X[i,] - X[j,])/(X[i,] + X[j,]))^2)
  sqrt(Dmat + t(Dmat))
}
my.data <- scale(rock, center=apply(rock,2,min), scale=diff(apply(rock,2,range)))+1
biplot(rock) |> PCO(Dmat = Clark.dist (rock), axes = "splines") |> plot()
```

Alternatively, the user can specify a function which computes a distance matrix or `dist` object. The Manhattan distance is not Euclidean embeddable, but the square root of the distance is. The function `sqrtManhattan` is included as an example of a function computing a Euclidean embeddable `dist` object.

```{r}
sqrtManhattan
biplot(rock, scaled = TRUE) |> PCO(dist.func = sqrtManhattan) |> plot()
```


# References