class: center, middle, inverse, title-slide # One line, two lines; red line, blue line. ### Stephen E. Lane
CEBRA --- <!-- Time-stamp: <2018-06-01 23:48:11 (slane)> --> <!-- Set fancy box in R --> # Acknowledgements .logo-overlay[![](../assets/logos/CEBRALogo-01.svg)] - Tony Arthur - Rob Cannon - Rose Souza Richards - Claire McDonald - Andrew Robinson --- background-image: url(img/redpill-bluepill.jpeg) background-size: cover --- class: inverse, center, middle background-image: # Sampling From Multiple Lines --- layout: true class: left, top background-image: url(../assets/logos/CEBRALogo-01.svg) background-size: 215px background-position: right top --- # ISPM 31 A lot to be sampled should be a number of units of a single commodity identifiable by its homogeneity in factors such as: .pull-left[ - origin - grower - packing facility - species, variety, or degree of maturity - exporter ] .pull-right[ - area of production - regulated pests and their characteristics - treatment at origin - type of processing. ] --- # Heterogeneous Consignments .pull-left[ ![](img/VegCorn.jpg) <!-- var rugosa --> ] .pull-right[ ![](img/800px-Corncobs.jpg) <!-- var indurata --> .footnote[ .right[ <a rel="license" href="https://creativecommons.org/licenses/by-sa/2.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/2.0/80x15.png" /></a> [Sam Fentress](https://commons.wikimedia.org/wiki/File:Corncobs.jpg) ] ] ] --- # The Regulator Suppose the regulator is happy to apply a compliance rule across an entire consignment. Then: - the *consignment* is compliant if the sample (across multiple lines) returns no contaminated items - if contaminated items are found, the *whole* consignment is rejected --- class: center, middle # The Problem .bigr[ How then, do we sample from each line, without making assumptions about individual line prevalences? ] -- # The Solution .bigr[ <span style="color:#f92672">By sampling proportional to the line volumes.</span> ] --- # The Setup Assume there are `\(K\)` lines, with `\(N_{k}\)` units in the `\(k\)`<sup>th</sup> line making a total of `\(N = \sum_{k} N_{k}\)` units. If the contamination rate in line `\(k\)` is `\(p_{k}\)` then the joint contamination rate, `\(p\)`, satisfies `\(\sum_{k} N_{k} p_{k} = N \cdot p\)`. Fix the total number of units sampled `\(n\)`, and choose some fixed weights `\(w_{k}\)` such that `\(\sum_k w_{k} = 1\)` and `\(n_{k} = n w_{k}\)`. The sensitivity of inspection is $$ `\begin{align} S = 1 - \prod_{k=1}^{K} (1 - p_{k}) ^ {n_{k}} \end{align}` $$ --- # The Line Prevalences With our sample size `\(n\)` and weights `\(w_{k}\)` fixed, we can find the line prevalences that will minimise the sensitivity as $$ `\begin{align} 1 - p_{k} = (1 - p) w_{k} \frac{N}{N_{k}} \end{align}` $$ -- > Note: this is true for any set of fixed weights, and should alert you to what form the `\(w_{k}\)` should take! Thus, if we take our weights as proportional to the line volumes, sensitivity is *minimised* when all line prevalences are equal to the joint contamination rate `\(p\)`. --- # The Sample Recall the sensitivity: $$ `\begin{align} S = 1 - \prod_{k=1}^{K} (1 - p_{k}) ^ {n_{k}} \end{align}` $$ Substituting `\(p_{k} = p\)`, gives the familiar sample size formula: $$ `\begin{align} n = \log(1 - S) / \log(1 - p) \end{align}` $$ Which we divvy up by sampling (at random) `\(n_{k} = n\frac{N_{k}}{N}\)` units from line `\(k\)`. --- class: center, middle # Imperfect Detection .bigr[ What about it? ] --- class: center, middle # No Worries ![](img/no-worries.gif) --- class: left, top # Line-based Sensitivity We can just replace the line prevalences with `\(q_{k}=p_{k} s_{k}\)`: $$ `\begin{align} S = 1 - \prod_{k=1}^{K} (1 - q_{k}) ^ {n_{k}} \end{align}` $$ Everything else works out, with weights and overall sensitivity adjusted for the imperfect detection: $$ `\begin{align} q & = \frac{N p}{f} \\ w_{k} & = \frac{n}{f} \frac{N_{k}}{s_{k}} \\ \text{where } f & = \sum_{k}^{K} \frac{N_{k}}{s_{k}} \end{align}` $$ --- class: inverse, center, middle background-image: # Sample Size Calculations for Phytosanitary Testing of Small Lots of Seed --- # The Problem Minimal infected material is required to transmit viruses and diseases, so we want to be *really* sure we detect contamination if present. - Common design parameters: - `\(p=0.1\%\)` - `\(S=0.95\)` <hr> -- .bigr[ Requires a sample size of approximately <span style="color:#f92672">3000</span>! ] (from our previous discussion on sensitivity) --- # The Problem .pull-left[ ![](img/small-seeds.jpg) ] .pull-right[ ![](img/Bulk-cargo-shipping-m1.jpg) .footnote[ .right[ [StockCargo](http://www.stockcargo.eu/) ] ] ] --- # Focussing on the Line What if there's only 1500 seeds in the consignment? -- We commonly appeal to the Hypergeometric distribution, `\(X \sim \text{Hypergeometric}(x, m, N, n)\)`. $$ `\begin{align} \Pr(X = x) & = \frac{{{m \choose x}} {{{N-m}} \choose {{n-x}}}}{{{N \choose n}}} \text{, for } x \in 0, 1, \ldots, \min\{m, n\} \end{align}` $$ Where `\(X\)` is the number of contaminated seeds present in a sample of size `\(n\)`, that is drawn from a lot of size `\(N\)` containing `\(m\)` contaminated seeds. --- # Example With design parameters and lot size: - `\(p=0.1\%\)` - `\(S=0.95\)` - `\(N=1500\)` we would *expect* there to be 1.5 contaminated seeds in the lot. .bigr[🤔] <hr> -- It's common to *round down* so that we assume in fact we have 1 contaminated seed. This obviously changes our design prevalence; it is now 0.07%. --- # Non-monotone Behaviour of the Sample Size <a href="index_files/figure-html/non-monotone-1.png" data-fancybox><img src="index_files/figure-html/non-monotone-1.png" width="622px" height="350px" style="display: block; margin: auto;" /></a> --- class: center, middle # Are My Design Parameters Arbitrary... ? ![](img/jacky.gif) What if we choose a sample size to minimise some measure of risk? --- # Leakage .center[ <img src="img/flow.png" height="400"></img> ] --- # What is the Expected Leakage? Intuitively, sometimes we select a sample that does not contain any contaminated seeds, *even though the consignment does*. This happens with probability `\(\Pr{}(X=0)\)`<sup>†</sup>. Sometimes, we do actually get some contaminated seeds in our sample; this happens with probability `\(\Pr{}(X>0)\)`. .footnote[ <sup>†</sup>Where `\(X\)` is the number of contaminated seed in the sample as we defined previously. ] <hr> -- If we **don't detect** any contaminated seeds, then `\(p \times (N - n)\)` contaminated will be <span style="color:#f92672">leaked</span> on average. If we **detect** some of the contaminated seeds, then 0 seeds are leaked (the whole consignment is rejected). Putting these together, the average number leaked is $$ `\begin{align} p \times{} (N - n) \times{} \Pr{}(X=0) \end{align}` $$ --- # The Average Leakage Rate <a href="index_files/figure-html/average-leakage-1.png" data-fancybox><img src="index_files/figure-html/average-leakage-1.png" width="622px" height="350px" style="display: block; margin: auto;" /></a> Leakage rate for lots of size `\(N=2500\)`, sample sizes of `\(n \in \{500, 600, \ldots, 1200\}\)`. --- class: center, middle # So What? ![](img/money.gif) --- # Maximum Average Leakage Rate ### How to set it? Suppose for a particular pathway there are 20 lots of 2500 seeds per year on average, and that a particular disease has a seed-to-seedling transmission rate of 19%. If contaminated, approximately `\(0.19 \times 20 \times 2500=9500\)` seeds would be capable of transmitting the disease if contaminated. Assume that we are willing to allow the possibility of at most `\(c=1\)` seed contaminated, to actually transmit the disease per year. We would thus set the maximum average leakage rate for the pathway at `\(1 / 9500 = 0.011\)`%. .footnote[ This is clearly just one example of how a maximum average leakage rate could be set. ] --- # Choosing a Sample Size We can show that the maximum average leakage rate for sample size `\(n\)` and lot size `\(N\)` is $$ `\begin{align} \max E(a) & = \frac{1}{n+1} \frac{N}{N - n} \left(1 - \frac{1}{n+1}\right)^{n} \end{align}` $$ 1. Set `\(n_{0} = 1\)`; 2. Calculate `\(\max E(a) | n_{0}\)` (in the Equation above); 3. Repeat the following until `\(\max E(a) | n_{k}\)` is less than the allowable leakage rate: - Set `\(n_{k} = n_{k-1} + 1\)`; - Calculate `\(\max E(a) | n_{k}\)`. 4. Use `\(n_{k}\)` as calculated in Step 3 as the optimal sample size. --- # Back to the Example Recall we said we'd be willing to allow the possibility of at most `\(c=1\)` seed contaminated, to actually transmit the disease per year, and our maximum average leakage rate was set at 0.011%. Assume we have an incoming lot of size 2500. Using the previous algorithm, we find that `\(n = 1458\)` should be sampled. <a href="index_files/figure-html/example-leakage-1.png" data-fancybox><img src="index_files/figure-html/example-leakage-1.png" width="622px" height="350px" style="display: block; margin: auto;" /></a> --- class: center, middle # Why is this Cool? ## Two Reasons .bigr[ <span style="color:#f92672">Because the focus is on the whole pathway, not just a particular consignment,</span> and <span style="color:#f92672">Because it allows us to focus on impact, not convention.</span> ] --- # Duality with Convention Of course, we could always turn this back in to a conventional setting. If the design contamination rate is 0.068%, then sampling 1458 seeds from a lot of 2500 has sensitivity 75.3% *for that particular lot*. <a href="index_files/figure-html/example-duality-1.png" data-fancybox><img src="index_files/figure-html/example-duality-1.png" width="622px" height="350px" style="display: block; margin: auto;" /></a> --- class: top, left # My Details <i class="fa fa-building fa-2x" aria-hidden="true"></i>.bigr[[CEBRA](https://cebra.unimelb.edu.au/)] <i class="fa fa-twitter fa-2x" aria-hidden="true"></i>.bigr[[@stephenelane](https://twitter.com/stephenelane/)] <i class="fa fa-user fa-2x" aria-hidden="true"></i>.bigr[[https://gtown-ds.netlify.com/](https://gtown-ds.netlify.com/)] <i class="fa fa-github fa-2x" aria-hidden="true"></i>.bigr[[SteveLane](https://github.com/SteveLane/)] </br>