Some math. We have some data $m_1,m_2,m_3,\ldots, m_n$ that are $iid$
$F$. Suppose $x_1,x_2,\ldots, x_k$ are the *unique* $k$ quantiles of the
data. Note that $k \le n$ because if the data is discrete in some regions of
$F$, the quantiles might be the same. I use the word quantiles but actually
mean the values $x_i = argmin_x P(X \le x) \ge k/M = F_n^{-1}(k/M)$ where $M$
could 100,1000, 2000 etc.

Then define $S = { 0,1,2,\ldots, M}$ and $x^*_i = \{s \in S: x_i = F_n^{-1}(s/M)\}$. Let’s take an example.

Consider

```
x = runif(100)
```

then the 10 unique quantiles corresponding to $1/10,\ldots,10/10$ are given by
`unique(quantile(x))`

which for this case is going to be `quantile(x)`

. They are
all unique. If $x_1$ be the first quantile (corresponding to $1/10$) then $x^*_1
= \{ 1 \}$.

Consider the following data (R code)

```
x = sample(c(1,2,3),100, replace=TRUE)
```

Then, the unique quantiles are `unique(quantile(x,1:10/10))`

which is

```
[1] 1 2 3
```

and then

\begin{align*}
x^*_1 &= \{s \in 1\ldots 10 : x_1 = F^{-1}_n( s/10) \} = \{1,2,3\} \\
x^*_2 &= \{s \in 1\ldots 10 : x_2 = F^{-1}_n( s/10) \} = \{4,5,6\} \\
x^*_3 &= \{s \in 1\ldots 10 : x_3 = F^{-1}_n( s/10) \} = \{7,8,9,10\}
\end{align*}

So what is this all about? Consider a sample from the Normal(0,1) distribution, and compute it’s ECDF $F_n$, then given another sample $Y$ from the Normal(0,1) distribution, $F_n(Y)$ is going to be uniformly distributed on $[0,1]$. Hence the average of this is going to be 0.5. By comparing the expectations of $E(F_n(Y’))$ for different distributions we can compare them to the baseline distribution i.e. Normal(0,1) (in this case). Technically, the theorem is not a n&s one, so to be precise if the expectation is not half, then $Y’$ is not distributed as $F$, but if the expectation is half, well, it doesn’t imply that $Y’$ is distributed as $F$.

It is equivalent to use the indices in the set S (defined above) as opposed the probabilities from $F_n$. But how do we handle the discrete case (which will have duplicates in the quantiles)?

Once again, consider the unique quantiles $x_1,x_2,\ldots,x_k$ and add to it
$x_0 = -\infty$ and $x_{k+1}=\infty$. Then let
$G=$ `findInterval(.,all.inside=TRUE)`

where $\text{findInterval}$ is the
R function.

\begin{align}
G(y,\{x_0,x_1,\ldots,x_{k+1}\}) = j+1 \qquad x_j \le y \lt x_{j+1}
\end{align}

and

\begin{align*}
I(y) = \text{Discrete Uniform Sample of Size 1 from } x^*_{G(y) - 1 }
\end{align*}

and $x^*_0$ is the set containing only 0. Also observe $\cup_{i=1}^k x^*_i = \{1,2,\ldots,M\}$ .
So consider again the continuous distribution: take a sample, compute the unique quantiles $x_i$ and the sets $x^*_i$ . Note, that
$|x^*_i| = 1$.If we apply $I$ to a new sample $y_j$ from the same distribution,
we will obtain roughly a uniform distribution on the numbers $1,2,\ldots, M$.
Consider the discrete case, for which the cardinality of some of the $x^*_i$ is greater than
1. In this case we randomly sample from the set $I(y_j)$, so if the sample size of
$y$ is very large, we will be uniformly sampling across $\\{1,2,\ldots, M\\}$.
If the distribution of $y$ is say to the right of $x$, then the distribution of
the $I(y)$'s will be correspondingly shifted right. If there are some $y$'s
which are less than $\min x$ we will have a lot of zeroes and the distribution
of $I(y)$ will be shifted left.

At Mozilla, we have a slew of measurments, see
this website.
I was asked to create some indices of the `CYCLE_COLLECTOR*`

measures. Index
creation is varied and one method I haven’t seen is comparing the data of a new
population to the distribution of a reference population using the above
method. The typical methods are based on normalization and essentially comparing
on the standard deviation scale. Using the index method, i need not worry about
outliers, or the shape of the distributions. The approach i’m thinking of involves

- take a reference population, say WINNT, 32 bit, Firefox v35 (A)
- compute the unique quantiles of
`CYCLE_COLLECTOR_COLLECTED`

(for example) - use the above method to compute $I(y)$ for the
`CYCLE_COLLECTOR_COLLECTED`

data from another population (B) - and use some summary statistic to compare $I(y)$ for the data for B with the $I(y)$ for A ( applying $I$ to the data for A will result in a uniform distribution)

That is just for one measure, i need a way to combine all the `CYCLE_COLLECTOR*`

measures into one `CYCLE_COLLECTION`

index.

```
.assign.bucket <- function(xp, M){
## xp is a vector of numbers
## xp can be data table, in which case 'x' are the values of X and
## 'n' is their frequency
x <- as.numeric(c(-Inf,if(is(xp,"data.table"))
xp[,wtd.quantile(x,n, 1:M/M)]
else
quantile(xp, 1:(M)),Inf))
xu <- sort(unique(x))
xt <- format(x, trim=TRUE,nsmall=20)
xstar <- split(1:(M+2), xt)
xstar <- xstar[order(as.numeric(names(xstar)))]
allcont <- all( unlist(lapply(xstar, length))==1)
structure(
if(allcont)
function(y,...)
findInterval(y,xu,all.inside=TRUE) -1
else
function(y,bk=FALSE) {
X <- findInterval(y, xu,all.inside=TRUE,rightmost.closed=TRUE)
if(bk) return(X)
sapply(X, function(yp) {
z <- xstar[[ yp ]]-1
if(length(z)==1) z else sample(z,1)
})}
, src = x, xstar=xstar)
xp <- sample(c(1,2), 4000,replace=TRUE)
h <- pmethod(xp,10)
yp <- sample(c(1,2), 40000,replace=TRUE)
yps <- h(yp)
round(prop.table(table(yps))*100,1)
}
```

The above assumes we have at least two distinct $x_i$ apart from $\infty$ and $-\infty$. Compute standard errors for what ever summary statistic I use in the above steps.

As always, I like to leave people with photos. The photos below were taken with an XPan using Kodak 800 and TriX-400. The square pictures were with a hasselblad on Portra 160(?, could be 400). On the whole I’m quite a fan of Photoworks, San Francisco. Their scan quality is very good though slightly overpriced. Now does that mean i lose myself down the rabbit hole of flatbed scanning? Well, not exactly, I’ve gone and gotten myself a Pakon scanner. Which thankfully is not such a rabbit hole as i might have thought. More on that later.