One line, two lines; red line, blue line.

class: center, middle, inverse, title-slide

# One line, two lines; red line, blue line.
### Stephen E. Lane CEBRA

---

# Acknowledgements

.logo-overlay[![](../assets/logos/CEBRALogo-01.svg)]

- Tony Arthur
- Rob Cannon
- Rose Souza Richards
- Claire McDonald
- Andrew Robinson

---
background-image: url(img/redpill-bluepill.jpeg)
background-size: cover

---
class: inverse, center, middle
background-image:

# Sampling From Multiple Lines

---
layout: true
class: left, top
background-image: url(../assets/logos/CEBRALogo-01.svg)
background-size: 215px
background-position: right top

---

# ISPM 31

A lot to be sampled should be a number of units of a single commodity identifiable by its homogeneity in factors such as:

.pull-left[
- origin
- grower
- packing facility
- species, variety, or degree of maturity
- exporter
]

.pull-right[
- area of production
- regulated pests and their characteristics
- treatment at origin
- type of processing.
]

---

# Heterogeneous Consignments

.pull-left[
![](img/VegCorn.jpg)

]

.pull-right[
![](img/800px-Corncobs.jpg)

.footnote[

.right[

<a rel="license" href="https://creativecommons.org/licenses/by-sa/2.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/2.0/80x15.png" /></a> [Sam Fentress](https://commons.wikimedia.org/wiki/File:Corncobs.jpg)

]

---

# The Regulator

Suppose the regulator is happy to apply a compliance rule across an entire consignment. Then:

- the *consignment* is compliant if the sample (across multiple lines) returns no contaminated items
- if contaminated items are found, the *whole* consignment is rejected

---
class: center, middle

# The Problem

.bigr[
How then, do we sample from each line, without making assumptions about individual line prevalences?
]

# The Solution

.bigr[
By sampling proportional to the line volumes.
]

---

# The Setup

Assume there are `$K$` lines, with `$N_{k}$` units in the `$k$`th line making a total of `$N = \sum_{k} N_{k}$` units. If the contamination rate in line `$k$` is `$p_{k}$` then the joint contamination rate, `$p$`, satisfies `$\sum_{k} N_{k} p_{k} = N \cdot p$`.

Fix the total number of units sampled `$n$`, and choose some fixed weights `$w_{k}$` such that `$\sum_k w_{k} = 1$` and `$n_{k} = n w_{k}$`.

The sensitivity of inspection is
$$
`\begin{align}
  S = 1 - \prod_{k=1}^{K} (1 - p_{k}) ^ {n_{k}}
\end{align}`
$$

---

# The Line Prevalences

With our sample size `$n$` and weights `$w_{k}$` fixed, we can find the line prevalences that will minimise the sensitivity as
$$
`\begin{align}
  1 -  p_{k} = (1 - p) w_{k} \frac{N}{N_{k}}
\end{align}`
$$

> Note: this is true for any set of fixed weights, and should alert you to what form the `$w_{k}$` should take!

Thus, if we take our weights as proportional to the line volumes, sensitivity is *minimised* when all line prevalences are equal to the joint contamination rate `$p$`.

---

# The Sample

Recall the sensitivity:
$$
`\begin{align}
  S = 1 - \prod_{k=1}^{K} (1 - p_{k}) ^ {n_{k}}
\end{align}`
$$

Substituting `$p_{k} = p$`, gives the familiar sample size formula:
$$
`\begin{align}
  n = \log(1 - S) / \log(1 - p)
\end{align}`
$$

Which we divvy up by sampling (at random) `$n_{k} = n\frac{N_{k}}{N}$` units from line `$k$`.

---
class: center, middle

# Imperfect Detection

.bigr[
What about it?
]

---
class: center, middle

# No Worries

![](img/no-worries.gif)

---
class: left, top

# Line-based Sensitivity

We can just replace the line prevalences with `$q_{k}=p_{k} s_{k}$`:
$$
`\begin{align}
  S = 1 - \prod_{k=1}^{K} (1 - q_{k}) ^ {n_{k}}
\end{align}`
$$

Everything else works out, with weights and overall sensitivity adjusted for the imperfect detection:
$$
`\begin{align}
  q & = \frac{N p}{f} \\
  w_{k} & = \frac{n}{f} \frac{N_{k}}{s_{k}} \\
  \text{where } f & = \sum_{k}^{K} \frac{N_{k}}{s_{k}}
\end{align}`
$$

---
class: inverse, center, middle
background-image:

# Sample Size Calculations for Phytosanitary Testing of Small Lots of Seed

---

# The Problem

Minimal infected material is required to transmit viruses and diseases, so we want to be *really* sure we detect contamination if present.

- Common design parameters:
  - `$p=0.1\%$`
  - `$S=0.95$`

<hr>

.bigr[
Requires a sample size of approximately 3000!
]

(from our previous discussion on sensitivity)

---

# The Problem

.pull-left[
![](img/small-seeds.jpg)
]

.pull-right[
![](img/Bulk-cargo-shipping-m1.jpg)

.footnote[

.right[

[StockCargo](http://www.stockcargo.eu/)

]

---

# Focussing on the Line

What if there's only 1500 seeds in the consignment?

We commonly appeal to the Hypergeometric distribution, `$X \sim \text{Hypergeometric}(x, m, N, n)$`.

$$
`\begin{align}
\Pr(X = x) & = \frac{{{m \choose x}} {{{N-m}} \choose {{n-x}}}}{{{N \choose n}}} \text{, for } x \in 0, 1, \ldots, \min\{m, n\}
\end{align}`
$$

Where `$X$` is the number of contaminated seeds present in a sample of size `$n$`, that is drawn from a lot of size `$N$` containing `$m$` contaminated seeds.

---

# Example

With design parameters and lot size:

- `$p=0.1\%$`
- `$S=0.95$`
- `$N=1500$`

we would *expect* there to be 1.5 contaminated seeds in the lot. .bigr[🤔]

<hr>

It's common to *round down* so that we assume in fact we have 1 contaminated seed. This obviously changes our design prevalence; it is now 0.07%.

---

# Non-monotone Behaviour of the Sample Size

---
class: center, middle

# Are My Design Parameters Arbitrary... ?

![](img/jacky.gif)

What if we choose a sample size to minimise some measure of risk?

---

# Leakage

.center[
<img src="img/flow.png" height="400"></img>
]

---

# What is the Expected Leakage?

Intuitively, sometimes we select a sample that does not contain any contaminated seeds, *even though the consignment does*. This happens with probability `$\Pr{}(X=0)$`&dagger;.

Sometimes, we do actually get some contaminated seeds in our sample; this happens with probability `$\Pr{}(X>0)$`.

.footnote[

&dagger;Where `$X$` is the number of contaminated seed in the sample as we defined previously.

]

<hr>

If we **don't detect** any contaminated seeds, then `$p \times (N - n)$` contaminated will be leaked on average. If we **detect** some of the contaminated seeds, then 0 seeds are leaked (the whole consignment is rejected).

Putting these together, the average number leaked is
$$
`\begin{align}
p \times{} (N - n) \times{} \Pr{}(X=0)
\end{align}`
$$

---

# The Average Leakage Rate

Leakage rate for lots of size `$N=2500$`, sample sizes of `$n \in \{500, 600, \ldots, 1200\}$`.

---
class: center, middle

# So What?

![](img/money.gif)

---

# Maximum Average Leakage Rate

### How to set it?

Suppose for a particular pathway there are 20 lots of 2500 seeds per year on average, and that a particular disease has a seed-to-seedling transmission rate of 19%. If contaminated, approximately `$0.19 \times 20 \times 2500=9500$` seeds would be capable of transmitting the disease if contaminated.

Assume that we are willing to allow the possibility of at most `$c=1$` seed contaminated, to actually transmit the disease per year. We would thus set the maximum average leakage rate for the pathway at `$1 / 9500 = 0.011$`%.

.footnote[
This is clearly just one example of how a maximum average leakage rate could be set.
]

---

# Choosing a Sample Size

We can show that the maximum average leakage rate for sample size `$n$` and lot size `$N$` is
$$
`\begin{align}
  \max E(a) & = \frac{1}{n+1} \frac{N}{N - n} \left(1 - \frac{1}{n+1}\right)^{n}
\end{align}`
$$

1. Set `$n_{0} = 1$`;
2. Calculate `$\max E(a) | n_{0}$` (in the Equation above);
3. Repeat the following until `$\max E(a) | n_{k}$` is less than the allowable leakage rate:
  - Set `$n_{k} = n_{k-1} + 1$`;
  - Calculate `$\max E(a) | n_{k}$`.
4. Use `$n_{k}$` as calculated in Step 3 as the optimal sample size.

---

# Back to the Example

Recall we said we'd be willing to allow the possibility of at most `$c=1$` seed contaminated, to actually transmit the disease per year, and our maximum average leakage rate was set at 0.011%. Assume we have an incoming lot of size 2500. Using the previous algorithm, we find that `$n = 1458$` should be sampled.

---
class: center, middle

# Why is this Cool?

## Two Reasons

.bigr[
Because the focus is on the whole pathway, not just a particular consignment,

and

Because it allows us to focus on impact, not convention.
]

---

# Duality with Convention

Of course, we could always turn this back in to a conventional setting. If the design contamination rate is 0.068%, then sampling 1458 seeds from a lot of 2500 has sensitivity 75.3% *for that particular lot*.

---
class: top, left

# My Details

.bigr[[CEBRA](https://cebra.unimelb.edu.au/)]

.bigr[[@stephenelane](https://twitter.com/stephenelane/)]

.bigr[[https://gtown-ds.netlify.com/](https://gtown-ds.netlify.com/)]

.bigr[[SteveLane](https://github.com/SteveLane/)]