# Sampling#

## Synopsis#

• Effect on proportion

• Percentages

• Bad as percentage can have different intuition for different ranges

• Difference is subjective

• Odds Ratio

• LOR

• Most appropriate (symmetric)

• Choosing Effect Size indicator is driven by context and familiarity. Use something which makes sense to individuals/ audience

• Precision

• How precise is data?

• Sources of Errors

• Sampling Errors

• Is my sample representative of population?

• Measurement Errors

• UOM

• Improper Recording

• Random Errors

• Least important

• Only one with mathematical model behind it, so it’s used all the time in statistics classes

## Imports#

import numpy as np
import scipy as sp
import scipy.stats as stats
import matplotlib as mpl
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
import pandas as pd
from IPython.display import display, display_html
import bqplot

from bqplot import LinearScale, Hist, Figure, Axis, ColorScale

from bqplot import pyplot as pltbq
# import seaborn as sns
# seed the random number generator so we all get the same results
np.random.seed(17)

# some nice colors from http://colorbrewer2.org/
COLOR1 = '#7fc97f'
COLOR2 = '#beaed4'
COLOR3 = '#fdc086'
COLOR4 = '#ffff99'
COLOR5 = '#386cb0'

mpl.rcParams['figure.figsize'] = (8.0, 9.0)

%matplotlib inline


## Part 1#

• Estimate Avg. weight of man & woman in US

• Quantify uncertainity in estimate

• Approach =>

• Simulate many exp

• Compare how results vary from one experiment to another

• Start by assuming distribution and move on show how to eliminate this assumption (solve without it)

#Weight of woman(in kg)

weight = stats.lognorm(0.23, 0, 70.8)
weight.mean(), weight.std()

(72.69764573296688, 16.944043048498038)

help(stats.lognorm)

Help on lognorm_gen in module scipy.stats._continuous_distns object:

class lognorm_gen(scipy.stats._distn_infrastructure.rv_continuous)
|  lognorm_gen(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, extradoc=None, seed=None)
|
|  A lognormal continuous random variable.
|
|  %(before_notes)s
|
|  Notes
|  -----
|  The probability density function for lognorm is:
|
|  .. math::
|
|      f(x, s) = \frac{1}{s x \sqrt{2\pi}}
|                \exp\left(-\frac{\log^2(x)}{2s^2}\right)
|
|  for :math:x > 0, :math:s > 0.
|
|  lognorm takes s as a shape parameter for :math:s.
|
|  %(after_notes)s
|
|  A common parametrization for a lognormal random variable Y is in
|  terms of the mean, mu, and standard deviation, sigma, of the
|  unique normally distributed random variable X such that exp(X) = Y.
|  This parametrization corresponds to setting s = sigma and scale =
|  exp(mu).
|
|  %(example)s
|
|  Method resolution order:
|      lognorm_gen
|      scipy.stats._distn_infrastructure.rv_continuous
|      scipy.stats._distn_infrastructure.rv_generic
|      builtins.object
|
|  Methods defined here:
|
|  fit(self, data, *args, **kwds)
|      Return MLEs for shape (if applicable), location, and scale
|      parameters from data.
|
|      MLE stands for Maximum Likelihood Estimate.  Starting estimates for
|      the fit are given by input arguments; for any arguments not provided
|      with starting estimates, self._fitstart(data) is called to generate
|      such.
|
|      One can hold some parameters fixed to specific values by passing in
|      keyword arguments f0, f1, ..., fn (for shape parameters)
|      and floc and fscale (for location and scale parameters,
|      respectively).
|
|      Parameters
|      ----------
|      data : array_like
|          Data to use in calculating the MLEs.
|      arg1, arg2, arg3,... : floats, optional
|          Starting value(s) for any shape-characterizing arguments (those not
|          provided will be determined by a call to _fitstart(data)).
|          No default value.
|      kwds : floats, optional
|          - loc: initial guess of the distribution's location parameter.
|          - scale: initial guess of the distribution's scale parameter.
|
|          Special keyword arguments are recognized as holding certain
|          parameters fixed:
|
|          - f0...fn : hold respective shape parameters fixed.
|            Alternatively, shape parameters to fix can be specified by name.
|            For example, if self.shapes == "a, b", fa and fix_a
|            are equivalent to f0, and fb and fix_b are
|            equivalent to f1.
|
|          - floc : hold location parameter fixed to specified value.
|
|          - fscale : hold scale parameter fixed to specified value.
|
|          - optimizer : The optimizer to use.  The optimizer must take func,
|            and starting position as the first two arguments,
|            plus args (for extra arguments to pass to the
|            function to be optimized) and disp=0 to suppress
|            output as keyword arguments.
|
|      Returns
|      -------
|      mle_tuple : tuple of floats
|          MLEs for any shape parameters (if applicable), followed by those
|          for location and scale. For most random variables, shape statistics
|          will be returned, but there are exceptions (e.g. norm).
|
|      Notes
|      -----
|      This fit is computed by maximizing a log-likelihood function, with
|      penalty applied for samples outside of range of the distribution. The
|      returned answer is not guaranteed to be the globally optimal MLE, it
|      may only be locally optimal, or the optimization may fail altogether.
|      If the data contain any of np.nan, np.inf, or -np.inf, the fit routine
|      will throw a RuntimeError.
|
|      When the location parameter is fixed by using the floc argument,
|      this function uses explicit formulas for the maximum likelihood
|      estimation of the log-normal shape and scale parameters, so the
|      optimizer, loc and scale keyword arguments are ignored.
|
|      Examples
|      --------
|
|      Generate some data to fit: draw random variates from the beta
|      distribution
|
|      >>> from scipy.stats import beta
|      >>> a, b = 1., 2.
|      >>> x = beta.rvs(a, b, size=1000)
|
|      Now we can fit all four parameters (a, b, loc and scale):
|
|      >>> a1, b1, loc1, scale1 = beta.fit(x)
|
|      We can also use some prior knowledge about the dataset: let's keep
|      loc and scale fixed:
|
|      >>> a1, b1, loc1, scale1 = beta.fit(x, floc=0, fscale=1)
|      >>> loc1, scale1
|      (0, 1)
|
|      We can also keep shape parameters fixed by using f-keywords. To
|      keep the zero-th shape parameter a equal 1, use f0=1 or,
|      equivalently, fa=1:
|
|      >>> a1, b1, loc1, scale1 = beta.fit(x, fa=1, floc=0, fscale=1)
|      >>> a1
|      1
|
|      Not all distributions return estimates for the shape parameters.
|      norm for example just returns estimates for location and scale:
|
|      >>> from scipy.stats import norm
|      >>> x = norm.rvs(a, b, size=1000, random_state=123)
|      >>> loc1, scale1 = norm.fit(x)
|      >>> loc1, scale1
|      (0.92087172783841631, 2.0015750750324668)
|
|  ----------------------------------------------------------------------
|  Methods inherited from scipy.stats._distn_infrastructure.rv_continuous:
|
|  __init__(self, momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, extradoc=None, seed=None)
|      Initialize self.  See help(type(self)) for accurate signature.
|
|  cdf(self, x, *args, **kwds)
|      Cumulative distribution function of the given RV.
|
|      Parameters
|      ----------
|      x : array_like
|          quantiles
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional
|          scale parameter (default=1)
|
|      Returns
|      -------
|      cdf : ndarray
|          Cumulative distribution function evaluated at x
|
|  expect(self, func=None, args=(), loc=0, scale=1, lb=None, ub=None, conditional=False, **kwds)
|      Calculate expected value of a function with respect to the
|      distribution by numerical integration.
|
|      The expected value of a function f(x) with respect to a
|      distribution dist is defined as::
|
|                  ub
|          E[f(x)] = Integral(f(x) * dist.pdf(x)),
|                  lb
|
|      where ub and lb are arguments and x has the dist.pdf(x)
|      distribution. If the bounds lb and ub correspond to the
|      support of the distribution, e.g. [-inf, inf] in the default
|      case, then the integral is the unrestricted expectation of f(x).
|      Also, the function f(x) may be defined such that f(x) is 0
|      outside a finite interval in which case the expectation is
|      calculated within the finite range [lb, ub].
|
|      Parameters
|      ----------
|      func : callable, optional
|          Function for which integral is calculated. Takes only one argument.
|          The default is the identity mapping f(x) = x.
|      args : tuple, optional
|          Shape parameters of the distribution.
|      loc : float, optional
|          Location parameter (default=0).
|      scale : float, optional
|          Scale parameter (default=1).
|      lb, ub : scalar, optional
|          Lower and upper bound for integration. Default is set to the
|          support of the distribution.
|      conditional : bool, optional
|          If True, the integral is corrected by the conditional probability
|          of the integration interval.  The return value is the expectation
|          of the function, conditional on being in the given interval.
|          Default is False.
|
|      Additional keyword arguments are passed to the integration routine.
|
|      Returns
|      -------
|      expect : float
|          The calculated expected value.
|
|      Notes
|      -----
|      The integration behavior of this function is inherited from
|      scipy.integrate.quad. Neither this function nor
|      scipy.integrate.quad can verify whether the integral exists or is
|      finite. For example cauchy(0).mean() returns np.nan and
|      cauchy(0).expect() returns 0.0.
|
|      The function is not vectorized.
|
|      Examples
|      --------
|
|      To understand the effect of the bounds of integration consider
|
|      >>> from scipy.stats import expon
|      >>> expon(1).expect(lambda x: 1, lb=0.0, ub=2.0)
|      0.6321205588285578
|
|      This is close to
|
|      >>> expon(1).cdf(2.0) - expon(1).cdf(0.0)
|      0.6321205588285577
|
|      If conditional=True
|
|      >>> expon(1).expect(lambda x: 1, lb=0.0, ub=2.0, conditional=True)
|      1.0000000000000002
|
|      The slight deviation from 1 is due to numerical integration.
|
|  fit_loc_scale(self, data, *args)
|      Estimate loc and scale parameters from data using 1st and 2nd moments.
|
|      Parameters
|      ----------
|      data : array_like
|          Data to fit.
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|
|      Returns
|      -------
|      Lhat : float
|          Estimated location parameter for the data.
|      Shat : float
|          Estimated scale parameter for the data.
|
|  isf(self, q, *args, **kwds)
|      Inverse survival function (inverse of sf) at q of the given RV.
|
|      Parameters
|      ----------
|      q : array_like
|          upper tail probability
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional
|          scale parameter (default=1)
|
|      Returns
|      -------
|      x : ndarray or scalar
|          Quantile corresponding to the upper tail probability q.
|
|  logcdf(self, x, *args, **kwds)
|      Log of the cumulative distribution function at x of the given RV.
|
|      Parameters
|      ----------
|      x : array_like
|          quantiles
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional
|          scale parameter (default=1)
|
|      Returns
|      -------
|      logcdf : array_like
|          Log of the cumulative distribution function evaluated at x
|
|  logpdf(self, x, *args, **kwds)
|      Log of the probability density function at x of the given RV.
|
|      This uses a more numerically accurate calculation if available.
|
|      Parameters
|      ----------
|      x : array_like
|          quantiles
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional
|          scale parameter (default=1)
|
|      Returns
|      -------
|      logpdf : array_like
|          Log of the probability density function evaluated at x
|
|  logsf(self, x, *args, **kwds)
|      Log of the survival function of the given RV.
|
|      Returns the log of the "survival function," defined as (1 - cdf),
|      evaluated at x.
|
|      Parameters
|      ----------
|      x : array_like
|          quantiles
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional
|          scale parameter (default=1)
|
|      Returns
|      -------
|      logsf : ndarray
|          Log of the survival function evaluated at x.
|
|  nnlf(self, theta, x)
|      Return negative loglikelihood function.
|
|      Notes
|      -----
|      This is -sum(log pdf(x, theta), axis=0) where theta are the
|      parameters (including loc and scale).
|
|  pdf(self, x, *args, **kwds)
|      Probability density function at x of the given RV.
|
|      Parameters
|      ----------
|      x : array_like
|          quantiles
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional
|          scale parameter (default=1)
|
|      Returns
|      -------
|      pdf : ndarray
|          Probability density function evaluated at x
|
|  ppf(self, q, *args, **kwds)
|      Percent point function (inverse of cdf) at q of the given RV.
|
|      Parameters
|      ----------
|      q : array_like
|          lower tail probability
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional
|          scale parameter (default=1)
|
|      Returns
|      -------
|      x : array_like
|          quantile corresponding to the lower tail probability q.
|
|  sf(self, x, *args, **kwds)
|      Survival function (1 - cdf) at x of the given RV.
|
|      Parameters
|      ----------
|      x : array_like
|          quantiles
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional
|          scale parameter (default=1)
|
|      Returns
|      -------
|      sf : array_like
|          Survival function evaluated at x
|
|  ----------------------------------------------------------------------
|  Methods inherited from scipy.stats._distn_infrastructure.rv_generic:
|
|  __call__(self, *args, **kwds)
|      Freeze the distribution for the given arguments.
|
|      Parameters
|      ----------
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution.  Should include all
|          the non-optional arguments, may include loc and scale.
|
|      Returns
|      -------
|      rv_frozen : rv_frozen instance
|          The frozen distribution.
|
|  __getstate__(self)
|
|  __setstate__(self, state)
|
|  entropy(self, *args, **kwds)
|      Differential entropy of the RV.
|
|      Parameters
|      ----------
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          Location parameter (default=0).
|      scale : array_like, optional  (continuous distributions only).
|          Scale parameter (default=1).
|
|      Notes
|      -----
|      Entropy is defined base e:
|
|      >>> drv = rv_discrete(values=((0, 1), (0.5, 0.5)))
|      >>> np.allclose(drv.entropy(), np.log(2.0))
|      True
|
|  freeze(self, *args, **kwds)
|      Freeze the distribution for the given arguments.
|
|      Parameters
|      ----------
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution.  Should include all
|          the non-optional arguments, may include loc and scale.
|
|      Returns
|      -------
|      rv_frozen : rv_frozen instance
|          The frozen distribution.
|
|  interval(self, alpha, *args, **kwds)
|      Confidence interval with equal areas around the median.
|
|      Parameters
|      ----------
|      alpha : array_like of float
|          Probability that an rv will be drawn from the returned range.
|          Each value should be in the range [0, 1].
|      arg1, arg2, ... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter, Default is 0.
|      scale : array_like, optional
|          scale parameter, Default is 1.
|
|      Returns
|      -------
|      a, b : ndarray of float
|          end-points of range that contain 100 * alpha % of the rv's
|          possible values.
|
|  mean(self, *args, **kwds)
|      Mean of the distribution.
|
|      Parameters
|      ----------
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional
|          scale parameter (default=1)
|
|      Returns
|      -------
|      mean : float
|          the mean of the distribution
|
|  median(self, *args, **kwds)
|      Median of the distribution.
|
|      Parameters
|      ----------
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          Location parameter, Default is 0.
|      scale : array_like, optional
|          Scale parameter, Default is 1.
|
|      Returns
|      -------
|      median : float
|          The median of the distribution.
|
|      --------
|      rv_discrete.ppf
|          Inverse of the CDF
|
|  moment(self, n, *args, **kwds)
|      n-th order non-central moment of distribution.
|
|      Parameters
|      ----------
|      n : int, n >= 1
|          Order of moment.
|      arg1, arg2, arg3,... : float
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional
|          scale parameter (default=1)
|
|  rvs(self, *args, **kwds)
|      Random variates of given type.
|
|      Parameters
|      ----------
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          Location parameter (default=0).
|      scale : array_like, optional
|          Scale parameter (default=1).
|      size : int or tuple of ints, optional
|          Defining number of random variates (default is 1).
|      random_state : {None, int, ~np.random.RandomState, ~np.random.Generator}, optional
|          If seed is None the ~np.random.RandomState singleton is used.
|          If seed is an int, a new RandomState instance is used, seeded
|          with seed.
|          If seed is already a RandomState or Generator instance,
|          then that object is used.
|          Default is None.
|
|      Returns
|      -------
|      rvs : ndarray or scalar
|          Random variates of given size.
|
|  stats(self, *args, **kwds)
|      Some statistics of the given RV.
|
|      Parameters
|      ----------
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional (continuous RVs only)
|          scale parameter (default=1)
|      moments : str, optional
|          composed of letters ['mvsk'] defining which moments to compute:
|          'm' = mean,
|          'v' = variance,
|          's' = (Fisher's) skew,
|          'k' = (Fisher's) kurtosis.
|          (default is 'mv')
|
|      Returns
|      -------
|      stats : sequence
|          of requested moments.
|
|  std(self, *args, **kwds)
|      Standard deviation of the distribution.
|
|      Parameters
|      ----------
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional
|          scale parameter (default=1)
|
|      Returns
|      -------
|      std : float
|          standard deviation of the distribution
|
|  support(self, *args, **kwargs)
|      Return the support of the distribution.
|
|      Parameters
|      ----------
|      arg1, arg2, ... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter, Default is 0.
|      scale : array_like, optional
|          scale parameter, Default is 1.
|      Returns
|      -------
|      a, b : float
|          end-points of the distribution's support.
|
|  var(self, *args, **kwds)
|      Variance of the distribution.
|
|      Parameters
|      ----------
|      arg1, arg2, arg3,... : array_like
|          The shape parameter(s) for the distribution (see docstring of the
|      loc : array_like, optional
|          location parameter (default=0)
|      scale : array_like, optional
|          scale parameter (default=1)
|
|      Returns
|      -------
|      var : float
|          the variance of the distribution
|
|  ----------------------------------------------------------------------
|  Data descriptors inherited from scipy.stats._distn_infrastructure.rv_generic:
|
|  __dict__
|      dictionary for instance variables (if defined)
|
|  __weakref__
|      list of weak references to the object (if defined)
|
|  random_state
|      Get or set the RandomState object for generating random variates.
|
|      This can be either None, int, a RandomState instance, or a
|      np.random.Generator instance.
|
|      If None (or np.random), use the RandomState singleton used by np.random.
|      If already a RandomState or Generator instance, use it.
|      If an int, use a new RandomState instance seeded with seed.


Note

Log normal distribution is specified by 3 parameters

• $$\sigma$$ - Shape parameter or Standard deviation of Distribution

• $$m$$ - Scale parameter (median) shrinking or stretching the graphs

• $$\theta$$ or ($$\mu$$) - Location parameter ; specifying where it is located in map

For log normal distribution. Check following links

rv = stats.lognorm(1, 0, 50) # Don't know shape , location , median
# xs = np.linspace(1, 100, 100)
ys = rv.rvs(10000)
plt.hist(ys)

(array([9.308e+03, 5.550e+02, 8.500e+01, 3.000e+01, 1.000e+01, 3.000e+00,
4.000e+00, 3.000e+00, 0.000e+00, 2.000e+00]),
array([1.12258114e+00, 2.17935158e+02, 4.34747735e+02, 6.51560312e+02,
8.68372890e+02, 1.08518547e+03, 1.30199804e+03, 1.51881062e+03,
1.73562320e+03, 1.95243578e+03, 2.16924835e+03]),
<BarContainer object of 10 artists>)

# For women weight
weight = stats.lognorm(0.23, 0, 70.8)
weight.mean(), weight.std()

(72.69764573296688, 16.944043048498038)

xs = np.linspace(20, 160, 100)
ys = weight.pdf(xs)

plt.plot(xs, ys, linewidth=4, color=COLOR1)
plt.xlabel('weight (kg)')
plt.ylabel('PDF')
plt.show()

def make_sample(n=100):
sample = weight.rvs(n)
return sample

sample = make_sample(100)
sample.mean(), sample.std()

(73.04819598338332, 16.22111742103728)

def sample_stat(sample):
return sample.mean()

sample_stat(sample)

73.04819598338332


Simulating experiment 1000 times

def computing_sampling_distribution(n=100, iters=1000):
stats = [sample_stat(make_sample(n)) for i in range(iters)]
return np.array(stats)

sample_means = computing_sampling_distribution(n=100, iters=1000)

plt.hist(sample_means, color=COLOR2)
plt.show()

@interact
def sim(n=[10,20,50,100, 200, 1000, 10000], iters=[100, 1000, 10000]):
sample_means = computing_sampling_distribution(n, iters)
mean = sample_means.mean()
std = sample_means.std()
plt.hist(sample_means, bins=30, color=COLOR2)
plt.axvline(mean, color=COLOR4)
plt.axvline(mean -3*std, color=COLOR5)
plt.axvline(mean +3*std, color=COLOR5)

conf_int = np.percentile(sample_means, [2.5, 97.5]) # 95% confidence interval
plt.axvline(conf_int[0], color=COLOR1)
plt.axvline(conf_int[1], color=COLOR1)
plt.ylabel("count")
plt.xlabel(f"sample_means(n = {n})")
plt.title(f" iters={iters},  $\mu_s$={mean:0.4f},  $\sigma_s$={std:0.4f}")
plt.show()


## Application for Sim Monitor#


caption = widgets.Label(value='The values of slider1 and slider2 are synchronized')
sliders1, slider2 = widgets.IntSlider(description='Slider 1'),\
widgets.IntSlider(description='Slider 2')
l = widgets.link((sliders1, 'value'), (slider2, 'value'))
display(caption, sliders1, slider2)


### Simple plot Linked to output widget#

%matplotlib widget

widget_plot = widgets.Output()

with widget_plot:
widget_plot.clear_output()
ax.clear()
fig, ax = plt.subplots(1,1)
#     display(ax.figure)
plt.show()

widget_plot

ax.plot(range(1, 100))
fig.canvas.draw()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-203552d8af35> in <module>
----> 1 ax.plot(range(1, 100))
2 fig.canvas.draw()

NameError: name 'ax' is not defined

ax.clear()
ax.plot(range(1, 20))
fig.canvas.draw()

# @interact
# def update_plot(n=2):
#     ax.clear()
#     ax.plot(np.linspace(1,100,100)**n)
#     fig.canvas.draw()

widget_plot

widget_pow = widgets.IntSlider(value=2,min=0, max=5)
widget_pow

def update_plot(n=2):
ax.clear()
ax.plot(np.linspace(1,100,100)**n)
fig.canvas.draw()

def handle_change(change):
update_plot(change.new)

widget_pow.observe(handle_change, "value")

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-23-cc35b1b6381a> in <module>
----> 1 widget_pow.observe(handle_change, "value")

NameError: name 'widget_pow' is not defined

widgets.VBox([widget_plot, widget_pow])

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-24-cb580b1392d7> in <module>
----> 1 widgets.VBox([widget_plot, widget_pow])

NameError: name 'widget_pow' is not defined


### Lognorm Application#

widgets.SelectionSlider(
options=[1,2, 5, 6, 10],
value=1,
description='n',
disabled=False,
continuous_update=False,
orientation='horizontal',
)

def sample_stat(sample, kind='mean'):
if kind =='mean':
return sample.mean()
elif kind == 'coeff_variation':
return sample.mean()/sample.std()
elif kind == 'min':
return sample.min()
elif kind == 'max':
return sample.max()
elif kind == 'median':
return np.percentile(sample, 50)
elif kind == 'p10':
return np.percentile(sample, 10)
elif kind == 'p90':
return np.percentile(sample, 90)
elif kind == 'IQR':
return np.percentile(sample, 75) - np.percentile(sample, 25)

def make_sample(rv, n=100):
sample = rv.rvs(n)
return sample

def computing_sampling_distribution(rv, n=100, iters=1000, kind='mean'):
stats = [sample_stat(make_sample(rv, n), kind) for i in range(iters)]
return np.array(stats)

sample_stat(sample, kind='mean')

fig, axes = plt.subplots(nrows=2, ncols=2)
axes

ax0, ax1, ax2, ax3 = axes.flatten()

# def figure(figsize=None):
#     'Temporary workaround for traditional figure behaviour with the ipympl widget backend'
#     fig = plt.figure()
#     if figsize:
#         w, h =  figsize
#     else:
#         w, h = plt.rcParams['figure.figsize']
#     fig.canvas.layout.height = str(h) + 'in'
#     fig.canvas.layout.width = str(w) + 'in'
#     return fig

def update_text(ax, x, y, s, align='right'):
ax.text(x, y, s,
horizontalalignment=align,
verticalalignment='top',
transform=ax.transAxes)

# class KindRange(object):
#     def __init__(self, kind, min_d, max_d):
#         self.kind = kind
#         self.min_d = min_d
#         self.max_d = max_d

class LogNormVisualizer(object):

def __init__(self, shape, loc, median, xs=np.linspace(20, 160, 100)):
plt.close('all')
self.shape = shape
self.loc = loc
self.median = median
self.w_shape = widgets.FloatText(value=shape, description='shape',disabled=False)
self.w_loc = widgets.FloatText(value=loc, description='loc',disabled=False)
self.w_median = widgets.FloatText(value=median, description='median',disabled=False)
self.w_n = widgets.SelectionSlider(
options=[10,20,50,100, 200, 1000, 10000],
value=10,
description='n',
disabled=False,
continuous_update=False,
orientation='horizontal',
)

self.w_kind = widgets.Dropdown(
options=['mean', 'coeff_variation', 'min', 'max', 'median', 'p10', 'p90', 'IQR'],
value='mean',
description='Statisitics',
disabled=False,
continuous_update=False,
orientation='horizontal',
)
self.kind = self.w_kind.value
self.w_iters = widgets.SelectionSlider(
options=[100,500, 1000, 5000, 10000],
value=100,
description='iters',
disabled=False,
continuous_update=False,
orientation='horizontal',
)
self.rv = stats.lognorm(shape, loc, median)
self.w_rv = widgets.Output()
plt.close()
with self.w_rv:
self.w_rv.clear_output()
self.fig, axes = plt.subplots(nrows=4, ncols=1)
self.ax1, self.ax2, self.ax3, self.ax4 = axes
self.ax1.set_title("Log Normal Distribution")
self.ax2.set_title("Statistics")
self.ax3.set_title("Stat_Variation_n")
self.ax3.set_title("Stat_Variation_iter")
self.fig.canvas.toolbar_visible = False
self.fig.canvas.footer_visible = False
#             self.fig.canvas.layout.width = '100%'
#             self.fig.canvas.layout.height = '6in'
#             self.fig.canvas.layout.width = '6in'
self.fig.canvas.resizable = True
self.fig.canvas.capture_scroll = True
self.fig.tight_layout()

self.range_store = {}
self.w_range_min = widgets.FloatText(value=0, description='r_min',disabled=False)
self.w_range_max = widgets.FloatText(value=100, description='r_max',disabled=False)
self.w_range_reinit = widgets.Button(disabled=False, description='ReInitialize')
self.w_rv.layout.display = 'flex-grow'
self.xs = xs
self.sample_d = None
self.r_min = None
self.r_max = None
self.df = pd.DataFrame(columns=['n', 'iters', 'kind','mean_d','std_d'])
self.update_rv()

#         self.update_rv_plot()
#         self.update_stat_plot()

def init_kind_range_store(self, kind, r_min, r_max):
self.update_range_store(kind, r_min, r_max)
self.w_range_min.value = r_min
self.w_range_max.value = r_max

def update_range_store(self, kind, r_min, r_max):
self.range_store[kind] = (r_min, r_max)

self.w_shape.observe(self.handle_shape_change, "value")
self.w_loc.observe(self.handle_loc_change, "value")
self.w_median.observe(self.handle_median_change,'value')
self.w_n.observe(self.handle_n_change, 'value')
self.w_iters.observe(self.handle_iters_change, 'value')
self.w_kind.observe(self.handle_kind_change, 'value')
self.w_range_min.observe(self.handle_w_range_min_change, 'value')
self.w_range_max.observe(self.handle_w_range_max_change, 'value')
self.w_range_reinit.on_click(self.handle_click)

#         plt.show()

def handle_click(self, b):
mean = self.sample_d.mean()
std = self.sample_d.std()
self.r_min = np.floor(mean-4*std)
self.r_max = np.ceil(mean+4*std)
self.init_kind_range_store(self.kind, self.r_min, self.r_max)
plt.show()

def reset(self):
self.update_range_store(self.kind, self.r_min, self.r_max)
self.ax2.set_xlim(self.r_min, self.r_max)
plt.show()
#         self.fig.canvas.draw()
#         plt.draw()

def handle_w_range_max_change(self, change):
self.r_min = self.w_range_min.value
self.r_max = change.new
self.reset()

def handle_w_range_min_change(self, change):
self.r_min = change.new
self.r_max = self.w_range_max.value
self.reset()

#         plt.show()

def handle_shape_change(self, change):
self.shape = change.new
self.update_rv()

def handle_loc_change(self, change):
self.loc = change.new
self.update_rv()

def handle_median_change(self, change):
self.median = change.new
self.update_rv()

def handle_n_change(self, change):
self.n = change.new
self.update_stat_plot()
self.update_var_plot()

def handle_iters_change(self, change):
self.iters = change.new
self.update_stat_plot()
self.update_var_plot()

def handle_kind_change(self, change):
self.kind = change.new
self.update_stat_plot()
self.ax3.clear()
self.update_var_plot()

def update_rv_plot(self):
#         self.ax1.set_xlim([np.random.randint(100),100])
self.ax1.clear()
self.ax1.set_title("Log Normal Distribution")
self.ys = self.rv.pdf(self.xs)
self.ax1.plot(self.xs, self.ys, linewidth=4, color=COLOR1)
update_text(self.ax1, 0.97, 0.95,  f"$\sigma$={self.shape}")
update_text(self.ax1, 0.97, 0.85,  f"$loc$={self.loc}")
update_text(self.ax1, 0.97, 0.75,  f"$\mu$={self.median}")
plt.show()

def update_stat_plot(self):
self.ax2.clear()
self.ax2.set_title(f"Statistics[{self.w_kind.value}]")
self.sample_d = computing_sampling_distribution(self.rv, self.w_n.value,
self.w_iters.value,
kind=self.w_kind.value)
mean = self.sample_d.mean()
std = self.sample_d.std()
self.r_min = np.floor(mean-4*std)
self.r_max = np.ceil(mean+4*std)
a,b = np.percentile(self.sample_d,[10, 90])
self.ax2.hist(self.sample_d, bins=30, color=COLOR2)
self.ax2.axvline(mean, color=COLOR4)
self.ax2.axvline(mean -3*std, color=COLOR5)
self.ax2.axvline(mean +3*std, color=COLOR5)
self.ax2.axvline(a, color=COLOR3)
self.ax2.axvline(b, color=COLOR3)
self.df.loc[self.df.shape[0]] = [self.w_n.value, self.w_iters.value, self.w_kind.value, mean, std]
if self.kind in self.range_store:
self.ax2.set_xlim(*self.range_store[self.kind])
pass
else:
self.init_kind_range_store(self.kind, self.r_min, self.r_max)
self.ax2.set_xlim(self.r_min, self.r_max)
update_text(self.ax2, 0.03, 0.95,  f"$\sigma_d$={std:0.2f}", align='left')
update_text(self.ax2, 0.03, 0.85,  f"$\mu_d$={mean:0.2f}", align='left')
update_text(self.ax2, 0.97, 0.95,  f"$p10_d$={a:0.2f}")
update_text(self.ax2, 0.97, 0.85,  f"$p90_d$={b:0.2f}")
plt.show()

def update_var_plot(self):
#         self.df.plot.scatter('n', 'value', ax=self.ax3)
sel = self.df[self.df.kind==self.kind]
sel.plot.scatter(x='n', y='mean_d', s=sel['std_d']*200, ax=self.ax3, logx=True)
self.ax3.set_ylim(*self.range_store[self.kind])
sel.plot.scatter(x='iters', y='mean_d', s=sel['std_d']*200, ax=self.ax4, logx=True)
self.ax4.set_ylim(*self.range_store[self.kind])
#         sns.scatterplot()

plt.show()

def update_rv(self):
self.rv = stats.lognorm(self.shape, self.loc, self.median)
#         self.w_rv.clear_output()
self.update_rv_plot()
self.update_stat_plot()
self.update_var_plot()

def view(self):
#         with self.w_rv:
#             display(self.ax.figure)
self.widget_setter_label = widgets.Label("Parameters", position='center')
self.widget_setter = widgets.VBox([self.widget_setter_label,
widgets.HBox([self.w_shape, self.w_loc, self.w_median])])
self.widget_simcontrol = widgets.HBox([self.w_n, self.w_iters, self.w_kind])
self.range_control = widgets.HBox([self.w_range_min, self.w_range_max, self.w_range_reinit])
ui = widgets.VBox([self.widget_setter,self.widget_simcontrol,self.range_control, self.w_rv])
return ui

lnv = LogNormVisualizer(0.23, 0, 70.8)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-34708152cec1> in <module>
----> 1 lnv = LogNormVisualizer(0.23, 0, 70.8)

<ipython-input-25-7235e5af6dae> in __init__(self, shape, loc, median, xs)
93         self.r_max = None
94         self.df = pd.DataFrame(columns=['n', 'iters', 'kind','mean_d','std_d'])
---> 95         self.update_rv()
97

<ipython-input-25-7235e5af6dae> in update_rv(self)
233 #         self.w_rv.clear_output()
234         self.update_rv_plot()
--> 235         self.update_stat_plot()
236         self.update_var_plot()
237

<ipython-input-25-7235e5af6dae> in update_stat_plot(self)
190         self.ax2.clear()
191         self.ax2.set_title(f"Statistics[{self.w_kind.value}]")
--> 192         self.sample_d = computing_sampling_distribution(self.rv, self.w_n.value,
193                                                        self.w_iters.value,
194                                                        kind=self.w_kind.value)

TypeError: computing_sampling_distribution() got an unexpected keyword argument 'kind'

lnv.rv.mean(), lnv.rv.std()

# df = lnv.df

# df.loc[df.shape[0]] = [1,2,3,4]; df

lnv.view()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-27-73bcf06dbfee> in <module>
----> 1 lnv.view()

NameError: name 'lnv' is not defined

# lnv.df[lnv.df.kind=='mean'].plot.scatter(x='n',y='std_d',  ax=lnv.ax3)
lnv.df[lnv.df.kind=='mean'].plot.scatter(x='n',y='std_d', s='mean_d', ax=lnv.ax3)
# lnv.df

np.percentile?

# plt.xlim?

w = widgets.IntText()
l = widgets.IntText()

t = 2
def handle_change(change):
t = change.new
l.value = t

w.observe(handle_change, "value")

widgets.VBox([w,l])

# widgets.Label?

# display?


#### Estimating PI Computationally#

• Area of circle => $$\pi r^2$$

• Area of square of size 2r => 4r^2

rv = stats.uniform(-1,2)

%matplotlib inline
plt.plot(np.linspace(1,10,10), rv.rvs(10))
plt.show()

@interact
def estimate_pi(n=(1,1000000)):
x = stats.uniform(-1,2)
y = stats.uniform(-1,2)
dist = np.sqrt( x.rvs(n)**2+y.rvs(n)**2)
#     print(dist[:5])
in_circle = dist <= 1
pi = sum(in_circle)*4/n
return pi
#     print(f"Estimated PI {pi} with n:{n}")

s = [estimate_pi(10000) for trials in range(1000)]

sample = np.array(s)
plt.hist(sample, bins=40)
plt.show()

sample.mean(), sample.std(), np.percentile(sample, [5,95])

(3.1407156, 0.017060112445115943, array([3.1116 , 3.16882]))


## Part 2#

What if we don’t know the underlying assumption?

• Take the Sample

• Generate Model for Population from Sample

• Calculate Sampling Statistics by running experiments on this model

sample = [1,2,3,4,5,2,1 ,4]

n = len(sample)

np.random.choice(sample,
n,
replace=True) # New sample from original sample , after this you can run experiment and calculate sampling statistics

array([4, 1, 1, 3, 2, 5, 5, 5])


### Resampling#

class Resampler(object):
def __init__(self, sample, xlim=None, iters=1000):
self.sample = sample
self.n = len(sample)
self.iters = iters

def resample(self):
new_sample = np.random.choice(self.sample, self.n, replace=True)
return new_sample

def sample_stat(self, sample):
return sample.mean()

def computing_sampling_distribution(self):
stats = [self.sample_stat(self.resample()) for i in range(self.iters)]
return np.array(stats)

def plot_summary_statistics_distribution(self):
fig, ax = plt.subplots(1)
stats = self.computing_sampling_distribution()

mean = stats.mean()
SE = stats.std()

conf_int = np.percentile(stats, [2.5, 97.5]) # 95% confidence interval
plt.axvline(mean, color=COLOR1)
plt.axvline(conf_int[0], color=COLOR1)
plt.axvline(conf_int[1], color=COLOR1)
#         plt.xlim(mean-4std, )
plt.hist(stats, color=COLOR4)
plt.show()

# s = np.random.random([1, 100]).flatten()
s = weight.rvs(10)
rsample = Resampler(s, iters=1000)
rsample.sample.shape

(10,)

# stat = rsample.computing_sampling_distribution()
# stat.shape

rsample.plot_summary_statistics_distribution()

 np.random.choice([1,2,3],3, replace=True)

array([3, 1, 3])

plt.plot(s)

[<matplotlib.lines.Line2D at 0x7feaabb10430>]

xs = np.linspace(20, 160, 100)
plt.plot(weight.pdf(xs))

[<matplotlib.lines.Line2D at 0x7feaabc22d60>]

type(weight)

scipy.stats._distn_infrastructure.rv_frozen

type(sample)

list

class StdResampler(Resampler):
def sample_stat(self, sample):
return sample.std()

s = weight.rvs(1000)
rsample = StdResampler(s, iters=1000)
rsample.sample.shape

(1000,)

rsample.plot_summary_statistics_distribution()


## Part 3#

In this section we will implement above for multiple group problems

mu1, sig1 = 178, 7.7
male_height = stats.norm(mu1, sig1); male_height

<scipy.stats._distn_infrastructure.rv_frozen at 0x7feaab9c58e0>

mu2, sig2 = 163, 7.3
female_height = stats.norm(mu2, sig2); female_height

<scipy.stats._distn_infrastructure.rv_frozen at 0x7feaab9ec4c0>

n = 1000
male_sample = male_height.rvs(1000)
female_sample = female_height.rvs(1000)
male_mean = male_height.mean()
female_mean = female_height.mean()
male_std = male_height.std()
female_std = female_height.std()

difference = (mu1 - mu2)

relative_difference_by_male = difference/mu1*100
relative_difference_by_female = difference/mu2*100
relative_difference_by_male, relative_difference_by_female

(8.426966292134832, 9.202453987730062)

thresh = (male_mean*female_std+female_mean*male_std)/(male_std+female_std); thresh

170.3

male_overlap = sum((male_sample < thresh))/ len(male_sample)
male_overlap

0.163

female_overlap = sum((female_sample > thresh))/ len(female_sample);
female_overlap

0.139

overlap = (male_overlap + female_overlap)/2; overlap

0.15100000000000002

prob_superiority = sum(male_sample > female_sample)/(len(male_sample)+len(female_sample))

prob_superiority = (male_sample > female_sample).mean()
prob_superiority

0.929

def overlap(grp1_sample, grp2_sample):
"""
grp1: Control
grp2: Treatment
"""

#     grp1_sample = grp1.rvs(1000)
#     grp2_sample = grp2.rvs(1000)
m1 = grp1_sample.mean()  #grp1.mean()
m2 = grp2_sample.mean()  #grp2.mean()
s1 = grp1_sample.std()  #grp1.std()
s2 = grp2_sample.std()  #grp1.std()
thresh = (m1*s2+m2*s1)/(s1+s2)

grp1_overlap = sum(grp1_sample<thresh)/ len(grp1_sample)
grp2_overlap = sum(grp2_sample>thresh)/ len(grp2_sample)
misclassification_rate = (grp1_overlap+grp2_overlap)/2
return misclassification_rate


grp1_sample = male_height.rvs(1000)
grp2_sample = female_height.rvs(1000)

overlap(grp1_sample, grp2_sample)

0.166

def prob_superior(grp1_sample, grp2_sample):
# Assumes same size
assert len(grp1_sample) == len(grp2_sample)
return(grp1_sample>grp2_sample).mean()

prob_superior(grp1_sample, grp2_sample)

0.932

grp1_sample.var()

58.99469052969996

def cohen_effect(grp1_sample, grp2_sample):
diff = grp1_sample.mean() - grp2_sample.mean()
var1, var2 = grp1_sample.var(), grp2_sample.var()
n1, n2 = len(grp1_sample), len(grp2_sample)
pooled_var = (n1*var1+n2*var2)/(n1+n2)
d = diff/np.sqrt(pooled_var)
return d

cohen_effect(grp1_sample, grp2_sample)

1.9942476570475194

def color_data(inp, CI):
a, b = CI
t1 = inp< a
t2= inp>b
color_data = (t1+t2).astype(np.int)
return color_data

class MultiGrpResampler(object):
def __init__(self, sample1, sample2, xlim=None, iters=1000, summary_stat='cohen'):
self.sample1 = sample1
self.sample2 = sample2
self.xlim = xlim
self.iters = iters
self.summary_stat = summary_stat

def resample(self):
new_sample1 = np.random.choice(self.sample1, len(self.sample1), replace=True)
new_sample2 = np.random.choice(self.sample2, len(self.sample2), replace=True)
return new_sample1, new_sample2

def calc_summary_stat(self, sample1, sample2):
if self.summary_stat == 'cohen':
return cohen_effect(sample1, sample2)
if self.summary_stat == 'overlap':
return overlap(sample1, sample2)
if self.summary_stat == 'prob_superiority':
return prob_superior(sample1, sample2)

def compute_sampling_distribution(self):
summary_stats = [self.calc_summary_stat(*self.resample()) for i in range(self.iters)]
return np.array(summary_stats)

def sampling_distribution(self):
summary_stats = self.compute_sampling_distribution()

mean = summary_stats.mean()
std = summary_stats.std()
CI = np.percentile(summary_stats, [5, 95]) # 90% confidence interval

return  summary_stats, mean, std, CI

def plot_sampling_distribution(self):
summary_stats, mean, std, CI = self.sampling_distribution()
bins = 30
#         hist_data = np.histogram(summary_stats, bins)[1]
x_sc = LinearScale()
y_sc = LinearScale()
col_sc = ColorScale(colors=['MediumSeaGreen', 'Red'])
y,edges = np.histogram(summary_stats, bins)
centers = 0.5*(edges[1:]+ edges[:-1])
cdata = np.array(col_sc.colors)[color_data(centers, CI)].tolist()
ax_x = Axis(scale=x_sc, tick_format='0.3f')
ax_y = Axis(scale=y_sc, orientation='vertical')
vline_mean = pltbq.vline(mean, stroke_width=2, colors=['orangered'], scales={'y': y_sc, 'x': x_sc, 'color': col_sc})
vline_a = pltbq.vline(CI[0], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc, 'color': col_sc})
vline_b = pltbq.vline(CI[1], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc, 'color': col_sc})
hist = Hist(sample=summary_stats, scales={'sample': x_sc, 'count': y_sc, 'color': col_sc}, bins=bins, colors=cdata)
fig = Figure(marks=[hist, vline_mean, vline_a, vline_b], axes=[ax_x, ax_y], padding=0,
title=f'Sampling Distribution {self.summary_stat}' )
#         print(fig.marks[0].colors)

return fig


sample1 = male_height.rvs(1000)
sample2 = female_height.rvs(1000)

rsampl = MultiGrpResampler(sample1, sample2,)
summary_stats, mean, std, CI = rsampl.sampling_distribution()
mean, std, CI

(1.988103230980105, 0.05500478400582647, array([1.89740944, 2.07562714]))

def color_data(inp, CI):
a, b = CI
t1 = inp< a
t2= inp>b
color_data = (t1+t2).astype(np.int)
return color_data

x_sc = LinearScale()
y_sc = LinearScale()
col_sc = ColorScale(colors=['MediumSeaGreen', 'Red'])

# hist = Hist(data=summary_stats, scales={'summary_statistics': x_sc, 'count':y_sc})

hist = Hist(sample=summary_stats, scales={'sample': x_sc, 'count': y_sc, 'color': col_sc})

len(hist.midpoints)

0

ax_x = Axis(scale=x_sc, tick_format='0.2f')
ax_y = Axis(scale=y_sc, orientation='vertical')
ax_x

Axis(scale=LinearScale(), tick_format='0.2f')

color_data(hist.midpoints, CI)

array([], dtype=int64)

np.array(col_sc.colors)[color_data(hist.midpoints, CI)].tolist()

[]

hist.bins = 30
hist.colors = np.array(col_sc.colors)[color_data(hist.midpoints, CI)].tolist()
hist.colors

[]

# Figure(marks=[hist], axes=[ax_x, ax_y], padding=0, title='Sampling Distribution' )

vline_mean = pltbq.vline(mean, stroke_width=2, colors=['orangered'], scales={'y': y_sc, 'x': x_sc})
vline_a = pltbq.vline(CI[0], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
vline_b = pltbq.vline(CI[1], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
vline_mean, vline_a, vline_b

(Lines(colors=['orangered'], interactions={'hover': 'tooltip'}, preserve_domain={'x': False, 'y': True}, scales={'y': LinearScale(allow_padding=False, max=1.0, min=0.0), 'x': LinearScale()}, scales_metadata={'x': {'orientation': 'horizontal', 'dimension': 'x'}, 'y': {'orientation': 'vertical', 'dimension': 'y'}, 'color': {'dimension': 'color'}}, tooltip_style={'opacity': 0.9}, x=array([2.12200944, 2.12200944]), y=array([0, 1])),
Lines(colors=['steelblue'], interactions={'hover': 'tooltip'}, preserve_domain={'x': False, 'y': True}, scales={'y': LinearScale(allow_padding=False, max=1.0, min=0.0), 'x': LinearScale()}, scales_metadata={'x': {'orientation': 'horizontal', 'dimension': 'x'}, 'y': {'orientation': 'vertical', 'dimension': 'y'}, 'color': {'dimension': 'color'}}, tooltip_style={'opacity': 0.9}, x=array([2.02652263, 2.02652263]), y=array([0, 1])),
Lines(colors=['steelblue'], interactions={'hover': 'tooltip'}, preserve_domain={'x': False, 'y': True}, scales={'y': LinearScale(allow_padding=False, max=1.0, min=0.0), 'x': LinearScale()}, scales_metadata={'x': {'orientation': 'horizontal', 'dimension': 'x'}, 'y': {'orientation': 'vertical', 'dimension': 'y'}, 'color': {'dimension': 'color'}}, tooltip_style={'opacity': 0.9}, x=array([2.21439821, 2.21439821]), y=array([0, 1])))

Figure(marks=[hist, vline_mean, vline_a, vline_b], axes=[ax_x, ax_y], padding=0, title='Sampling Distribution 2' )


### Function Call#

bins=30
y,edges = np.histogram(summary_stats, bins)
centers = 0.5*(edges[1:]+ edges[:-1])
cdata = np.array(col_sc.colors)[color_data(centers, CI)].tolist()
cdata

['Red',
'Red',
'Red',
'Red',
'Red',
'Red',
'Red',
'Red',
'MediumSeaGreen',
'MediumSeaGreen',
'MediumSeaGreen',
'MediumSeaGreen',
'MediumSeaGreen',
'MediumSeaGreen',
'MediumSeaGreen',
'MediumSeaGreen',
'MediumSeaGreen',
'MediumSeaGreen',
'MediumSeaGreen',
'MediumSeaGreen',
'MediumSeaGreen',
'MediumSeaGreen',
'Red',
'Red',
'Red',
'Red',
'Red',
'Red',
'Red',
'Red']

n = 100000
sample1 = male_height.rvs(n)
sample2 = female_height.rvs(1000)

rsampl = MultiGrpResampler(sample1, sample2,summary_stat='cohen')
fig = rsampl.plot_sampling_distribution()
# fig.marks[0].colors = cdata
fig

fig.marks[0]

Hist(bins=30, colors=['Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red'], interactions={'hover': 'tooltip'}, scales={'sample': LinearScale(), 'count': LinearScale(), 'color': ColorScale(colors=['MediumSeaGreen', 'Red'])}, scales_metadata={'sample': {'orientation': 'horizontal', 'dimension': 'x'}, 'count': {'orientation': 'vertical', 'dimension': 'y'}}, tooltip_style={'opacity': 0.9})

fig.marks[0].bins

30

fig

summary_stat = 'cohen'
summary_stats, mean, std, CI = rsampl.sampling_distribution()
x_sc = LinearScale()
y_sc = LinearScale()
col_sc = ColorScale(colors=['MediumSeaGreen', 'Red'])
ax_x = Axis(scale=x_sc, tick_format='0.2f')
ax_y = Axis(scale=y_sc, orientation='vertical')
hist = Hist(sample=summary_stats, scales={'sample': x_sc, 'count': y_sc, 'color': col_sc}, bins=30)

hist.colors

['steelblue']

vline_mean = pltbq.vline(mean, stroke_width=2, colors=['orangered'], scales={'y': y_sc, 'x': x_sc})
vline_a = pltbq.vline(CI[0], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
vline_b = pltbq.vline(CI[1], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
fig = Figure(marks=[hist, vline_mean, vline_a, vline_b], axes=[ax_x, ax_y], padding=0,
title=f'Sampling Distribution {summary_stat}' )

fig

hist.bins = 30
hist.colors = np.array(col_sc.colors)[color_data(hist.midpoints, CI)].tolist()
hist.colors

[]

fig

summary_stat = 'cohen'
summary_stats, mean, std, CI = rsampl.sampling_distribution()
x_sc = LinearScale()
y_sc = LinearScale()
col_sc = ColorScale(colors=['MediumSeaGreen', 'Red'])
ax_x = Axis(scale=x_sc, tick_format='0.2f')
ax_y = Axis(scale=y_sc, orientation='vertical')
hist = Hist(sample=summary_stats, scales={'sample': x_sc, 'count': y_sc, 'color': col_sc}, bins=30)
# hist.bins = 30
with hist.hold_sync():
hist.colors = np.array(col_sc.colors)[color_data(hist.midpoints, CI)].tolist()
vline_mean = pltbq.vline(mean, stroke_width=2, colors=['orangered'], scales={'y': y_sc, 'x': x_sc})
vline_a = pltbq.vline(CI[0], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
vline_b = pltbq.vline(CI[1], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
fig = Figure(marks=[hist, vline_mean, vline_a, vline_b], axes=[ax_x, ax_y], padding=0,
title=f'Sampling Distribution {summary_stat}' )
fig

np.histogram([12,3,4,5, 19], bins=3)

(array([3, 1, 1]), array([ 3.        ,  8.33333333, 13.66666667, 19.        ]))

np.histogram?