Sampling#

Synopsis#

  • Effect on proportion

    • Percentages

      • Bad as percentage can have different intuition for different ranges

      • Difference is subjective

    • Odds Ratio

    • LOR

      • Most appropriate (symmetric)

  • Choosing Effect Size indicator is driven by context and familiarity. Use something which makes sense to individuals/ audience

  • Precision

    • How precise is data?

    • Sources of Errors

      • Sampling Errors

        • Is my sample representative of population?

      • Measurement Errors

        • UOM

        • Improper Recording

      • Random Errors

        • Least important

        • Only one with mathematical model behind it, so it’s used all the time in statistics classes

Imports#

import numpy as np
import scipy as sp
import scipy.stats as stats
import matplotlib as mpl
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
import pandas as pd
from IPython.display import display, display_html
import bqplot

from bqplot import LinearScale, Hist, Figure, Axis, ColorScale

from bqplot import pyplot as pltbq
# import seaborn as sns
# seed the random number generator so we all get the same results
np.random.seed(17)

# some nice colors from http://colorbrewer2.org/
COLOR1 = '#7fc97f'
COLOR2 = '#beaed4'
COLOR3 = '#fdc086'
COLOR4 = '#ffff99'
COLOR5 = '#386cb0'

mpl.rcParams['figure.figsize'] = (8.0, 9.0)
%matplotlib inline

Part 1#

  • Estimate Avg. weight of man & woman in US

  • Quantify uncertainity in estimate

  • Approach =>

    • Simulate many exp

    • Compare how results vary from one experiment to another

  • Start by assuming distribution and move on show how to eliminate this assumption (solve without it)

#Weight of woman(in kg)

weight = stats.lognorm(0.23, 0, 70.8)
weight.mean(), weight.std()
(72.69764573296688, 16.944043048498038)
help(stats.lognorm)
Help on lognorm_gen in module scipy.stats._continuous_distns object:

class lognorm_gen(scipy.stats._distn_infrastructure.rv_continuous)
 |  lognorm_gen(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, extradoc=None, seed=None)
 |  
 |  A lognormal continuous random variable.
 |  
 |  %(before_notes)s
 |  
 |  Notes
 |  -----
 |  The probability density function for `lognorm` is:
 |  
 |  .. math::
 |  
 |      f(x, s) = \frac{1}{s x \sqrt{2\pi}}
 |                \exp\left(-\frac{\log^2(x)}{2s^2}\right)
 |  
 |  for :math:`x > 0`, :math:`s > 0`.
 |  
 |  `lognorm` takes ``s`` as a shape parameter for :math:`s`.
 |  
 |  %(after_notes)s
 |  
 |  A common parametrization for a lognormal random variable ``Y`` is in
 |  terms of the mean, ``mu``, and standard deviation, ``sigma``, of the
 |  unique normally distributed random variable ``X`` such that exp(X) = Y.
 |  This parametrization corresponds to setting ``s = sigma`` and ``scale =
 |  exp(mu)``.
 |  
 |  %(example)s
 |  
 |  Method resolution order:
 |      lognorm_gen
 |      scipy.stats._distn_infrastructure.rv_continuous
 |      scipy.stats._distn_infrastructure.rv_generic
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  fit(self, data, *args, **kwds)
 |      Return MLEs for shape (if applicable), location, and scale
 |      parameters from data.
 |      
 |      MLE stands for Maximum Likelihood Estimate.  Starting estimates for
 |      the fit are given by input arguments; for any arguments not provided
 |      with starting estimates, ``self._fitstart(data)`` is called to generate
 |      such.
 |      
 |      One can hold some parameters fixed to specific values by passing in
 |      keyword arguments ``f0``, ``f1``, ..., ``fn`` (for shape parameters)
 |      and ``floc`` and ``fscale`` (for location and scale parameters,
 |      respectively).
 |      
 |      Parameters
 |      ----------
 |      data : array_like
 |          Data to use in calculating the MLEs.
 |      arg1, arg2, arg3,... : floats, optional
 |          Starting value(s) for any shape-characterizing arguments (those not
 |          provided will be determined by a call to ``_fitstart(data)``).
 |          No default value.
 |      kwds : floats, optional
 |          - `loc`: initial guess of the distribution's location parameter.
 |          - `scale`: initial guess of the distribution's scale parameter.
 |      
 |          Special keyword arguments are recognized as holding certain
 |          parameters fixed:
 |      
 |          - f0...fn : hold respective shape parameters fixed.
 |            Alternatively, shape parameters to fix can be specified by name.
 |            For example, if ``self.shapes == "a, b"``, ``fa`` and ``fix_a``
 |            are equivalent to ``f0``, and ``fb`` and ``fix_b`` are
 |            equivalent to ``f1``.
 |      
 |          - floc : hold location parameter fixed to specified value.
 |      
 |          - fscale : hold scale parameter fixed to specified value.
 |      
 |          - optimizer : The optimizer to use.  The optimizer must take ``func``,
 |            and starting position as the first two arguments,
 |            plus ``args`` (for extra arguments to pass to the
 |            function to be optimized) and ``disp=0`` to suppress
 |            output as keyword arguments.
 |      
 |      Returns
 |      -------
 |      mle_tuple : tuple of floats
 |          MLEs for any shape parameters (if applicable), followed by those
 |          for location and scale. For most random variables, shape statistics
 |          will be returned, but there are exceptions (e.g. ``norm``).
 |      
 |      Notes
 |      -----
 |      This fit is computed by maximizing a log-likelihood function, with
 |      penalty applied for samples outside of range of the distribution. The
 |      returned answer is not guaranteed to be the globally optimal MLE, it
 |      may only be locally optimal, or the optimization may fail altogether.
 |      If the data contain any of np.nan, np.inf, or -np.inf, the fit routine
 |      will throw a RuntimeError.
 |      
 |      When the location parameter is fixed by using the `floc` argument,
 |      this function uses explicit formulas for the maximum likelihood
 |      estimation of the log-normal shape and scale parameters, so the
 |      `optimizer`, `loc` and `scale` keyword arguments are ignored.
 |      
 |      Examples
 |      --------
 |      
 |      Generate some data to fit: draw random variates from the `beta`
 |      distribution
 |      
 |      >>> from scipy.stats import beta
 |      >>> a, b = 1., 2.
 |      >>> x = beta.rvs(a, b, size=1000)
 |      
 |      Now we can fit all four parameters (``a``, ``b``, ``loc`` and ``scale``):
 |      
 |      >>> a1, b1, loc1, scale1 = beta.fit(x)
 |      
 |      We can also use some prior knowledge about the dataset: let's keep
 |      ``loc`` and ``scale`` fixed:
 |      
 |      >>> a1, b1, loc1, scale1 = beta.fit(x, floc=0, fscale=1)
 |      >>> loc1, scale1
 |      (0, 1)
 |      
 |      We can also keep shape parameters fixed by using ``f``-keywords. To
 |      keep the zero-th shape parameter ``a`` equal 1, use ``f0=1`` or,
 |      equivalently, ``fa=1``:
 |      
 |      >>> a1, b1, loc1, scale1 = beta.fit(x, fa=1, floc=0, fscale=1)
 |      >>> a1
 |      1
 |      
 |      Not all distributions return estimates for the shape parameters.
 |      ``norm`` for example just returns estimates for location and scale:
 |      
 |      >>> from scipy.stats import norm
 |      >>> x = norm.rvs(a, b, size=1000, random_state=123)
 |      >>> loc1, scale1 = norm.fit(x)
 |      >>> loc1, scale1
 |      (0.92087172783841631, 2.0015750750324668)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from scipy.stats._distn_infrastructure.rv_continuous:
 |  
 |  __init__(self, momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, extradoc=None, seed=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  cdf(self, x, *args, **kwds)
 |      Cumulative distribution function of the given RV.
 |      
 |      Parameters
 |      ----------
 |      x : array_like
 |          quantiles
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      cdf : ndarray
 |          Cumulative distribution function evaluated at `x`
 |  
 |  expect(self, func=None, args=(), loc=0, scale=1, lb=None, ub=None, conditional=False, **kwds)
 |      Calculate expected value of a function with respect to the
 |      distribution by numerical integration.
 |      
 |      The expected value of a function ``f(x)`` with respect to a
 |      distribution ``dist`` is defined as::
 |      
 |                  ub
 |          E[f(x)] = Integral(f(x) * dist.pdf(x)),
 |                  lb
 |      
 |      where ``ub`` and ``lb`` are arguments and ``x`` has the ``dist.pdf(x)``
 |      distribution. If the bounds ``lb`` and ``ub`` correspond to the
 |      support of the distribution, e.g. ``[-inf, inf]`` in the default
 |      case, then the integral is the unrestricted expectation of ``f(x)``.
 |      Also, the function ``f(x)`` may be defined such that ``f(x)`` is ``0``
 |      outside a finite interval in which case the expectation is
 |      calculated within the finite range ``[lb, ub]``.
 |      
 |      Parameters
 |      ----------
 |      func : callable, optional
 |          Function for which integral is calculated. Takes only one argument.
 |          The default is the identity mapping f(x) = x.
 |      args : tuple, optional
 |          Shape parameters of the distribution.
 |      loc : float, optional
 |          Location parameter (default=0).
 |      scale : float, optional
 |          Scale parameter (default=1).
 |      lb, ub : scalar, optional
 |          Lower and upper bound for integration. Default is set to the
 |          support of the distribution.
 |      conditional : bool, optional
 |          If True, the integral is corrected by the conditional probability
 |          of the integration interval.  The return value is the expectation
 |          of the function, conditional on being in the given interval.
 |          Default is False.
 |      
 |      Additional keyword arguments are passed to the integration routine.
 |      
 |      Returns
 |      -------
 |      expect : float
 |          The calculated expected value.
 |      
 |      Notes
 |      -----
 |      The integration behavior of this function is inherited from
 |      `scipy.integrate.quad`. Neither this function nor
 |      `scipy.integrate.quad` can verify whether the integral exists or is
 |      finite. For example ``cauchy(0).mean()`` returns ``np.nan`` and
 |      ``cauchy(0).expect()`` returns ``0.0``.
 |      
 |      The function is not vectorized.
 |      
 |      Examples
 |      --------
 |      
 |      To understand the effect of the bounds of integration consider
 |      
 |      >>> from scipy.stats import expon
 |      >>> expon(1).expect(lambda x: 1, lb=0.0, ub=2.0)
 |      0.6321205588285578
 |      
 |      This is close to
 |      
 |      >>> expon(1).cdf(2.0) - expon(1).cdf(0.0)
 |      0.6321205588285577
 |      
 |      If ``conditional=True``
 |      
 |      >>> expon(1).expect(lambda x: 1, lb=0.0, ub=2.0, conditional=True)
 |      1.0000000000000002
 |      
 |      The slight deviation from 1 is due to numerical integration.
 |  
 |  fit_loc_scale(self, data, *args)
 |      Estimate loc and scale parameters from data using 1st and 2nd moments.
 |      
 |      Parameters
 |      ----------
 |      data : array_like
 |          Data to fit.
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information).
 |      
 |      Returns
 |      -------
 |      Lhat : float
 |          Estimated location parameter for the data.
 |      Shat : float
 |          Estimated scale parameter for the data.
 |  
 |  isf(self, q, *args, **kwds)
 |      Inverse survival function (inverse of `sf`) at q of the given RV.
 |      
 |      Parameters
 |      ----------
 |      q : array_like
 |          upper tail probability
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      x : ndarray or scalar
 |          Quantile corresponding to the upper tail probability q.
 |  
 |  logcdf(self, x, *args, **kwds)
 |      Log of the cumulative distribution function at x of the given RV.
 |      
 |      Parameters
 |      ----------
 |      x : array_like
 |          quantiles
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      logcdf : array_like
 |          Log of the cumulative distribution function evaluated at x
 |  
 |  logpdf(self, x, *args, **kwds)
 |      Log of the probability density function at x of the given RV.
 |      
 |      This uses a more numerically accurate calculation if available.
 |      
 |      Parameters
 |      ----------
 |      x : array_like
 |          quantiles
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      logpdf : array_like
 |          Log of the probability density function evaluated at x
 |  
 |  logsf(self, x, *args, **kwds)
 |      Log of the survival function of the given RV.
 |      
 |      Returns the log of the "survival function," defined as (1 - `cdf`),
 |      evaluated at `x`.
 |      
 |      Parameters
 |      ----------
 |      x : array_like
 |          quantiles
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      logsf : ndarray
 |          Log of the survival function evaluated at `x`.
 |  
 |  nnlf(self, theta, x)
 |      Return negative loglikelihood function.
 |      
 |      Notes
 |      -----
 |      This is ``-sum(log pdf(x, theta), axis=0)`` where `theta` are the
 |      parameters (including loc and scale).
 |  
 |  pdf(self, x, *args, **kwds)
 |      Probability density function at x of the given RV.
 |      
 |      Parameters
 |      ----------
 |      x : array_like
 |          quantiles
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      pdf : ndarray
 |          Probability density function evaluated at x
 |  
 |  ppf(self, q, *args, **kwds)
 |      Percent point function (inverse of `cdf`) at q of the given RV.
 |      
 |      Parameters
 |      ----------
 |      q : array_like
 |          lower tail probability
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      x : array_like
 |          quantile corresponding to the lower tail probability q.
 |  
 |  sf(self, x, *args, **kwds)
 |      Survival function (1 - `cdf`) at x of the given RV.
 |      
 |      Parameters
 |      ----------
 |      x : array_like
 |          quantiles
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      sf : array_like
 |          Survival function evaluated at x
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from scipy.stats._distn_infrastructure.rv_generic:
 |  
 |  __call__(self, *args, **kwds)
 |      Freeze the distribution for the given arguments.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution.  Should include all
 |          the non-optional arguments, may include ``loc`` and ``scale``.
 |      
 |      Returns
 |      -------
 |      rv_frozen : rv_frozen instance
 |          The frozen distribution.
 |  
 |  __getstate__(self)
 |  
 |  __setstate__(self, state)
 |  
 |  entropy(self, *args, **kwds)
 |      Differential entropy of the RV.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information).
 |      loc : array_like, optional
 |          Location parameter (default=0).
 |      scale : array_like, optional  (continuous distributions only).
 |          Scale parameter (default=1).
 |      
 |      Notes
 |      -----
 |      Entropy is defined base `e`:
 |      
 |      >>> drv = rv_discrete(values=((0, 1), (0.5, 0.5)))
 |      >>> np.allclose(drv.entropy(), np.log(2.0))
 |      True
 |  
 |  freeze(self, *args, **kwds)
 |      Freeze the distribution for the given arguments.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution.  Should include all
 |          the non-optional arguments, may include ``loc`` and ``scale``.
 |      
 |      Returns
 |      -------
 |      rv_frozen : rv_frozen instance
 |          The frozen distribution.
 |  
 |  interval(self, alpha, *args, **kwds)
 |      Confidence interval with equal areas around the median.
 |      
 |      Parameters
 |      ----------
 |      alpha : array_like of float
 |          Probability that an rv will be drawn from the returned range.
 |          Each value should be in the range [0, 1].
 |      arg1, arg2, ... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information).
 |      loc : array_like, optional
 |          location parameter, Default is 0.
 |      scale : array_like, optional
 |          scale parameter, Default is 1.
 |      
 |      Returns
 |      -------
 |      a, b : ndarray of float
 |          end-points of range that contain ``100 * alpha %`` of the rv's
 |          possible values.
 |  
 |  mean(self, *args, **kwds)
 |      Mean of the distribution.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      mean : float
 |          the mean of the distribution
 |  
 |  median(self, *args, **kwds)
 |      Median of the distribution.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          Location parameter, Default is 0.
 |      scale : array_like, optional
 |          Scale parameter, Default is 1.
 |      
 |      Returns
 |      -------
 |      median : float
 |          The median of the distribution.
 |      
 |      See Also
 |      --------
 |      rv_discrete.ppf
 |          Inverse of the CDF
 |  
 |  moment(self, n, *args, **kwds)
 |      n-th order non-central moment of distribution.
 |      
 |      Parameters
 |      ----------
 |      n : int, n >= 1
 |          Order of moment.
 |      arg1, arg2, arg3,... : float
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information).
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |  
 |  rvs(self, *args, **kwds)
 |      Random variates of given type.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information).
 |      loc : array_like, optional
 |          Location parameter (default=0).
 |      scale : array_like, optional
 |          Scale parameter (default=1).
 |      size : int or tuple of ints, optional
 |          Defining number of random variates (default is 1).
 |      random_state : {None, int, `~np.random.RandomState`, `~np.random.Generator`}, optional
 |          If `seed` is `None` the `~np.random.RandomState` singleton is used.
 |          If `seed` is an int, a new ``RandomState`` instance is used, seeded
 |          with seed.
 |          If `seed` is already a ``RandomState`` or ``Generator`` instance,
 |          then that object is used.
 |          Default is None.
 |      
 |      Returns
 |      -------
 |      rvs : ndarray or scalar
 |          Random variates of given `size`.
 |  
 |  stats(self, *args, **kwds)
 |      Some statistics of the given RV.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional (continuous RVs only)
 |          scale parameter (default=1)
 |      moments : str, optional
 |          composed of letters ['mvsk'] defining which moments to compute:
 |          'm' = mean,
 |          'v' = variance,
 |          's' = (Fisher's) skew,
 |          'k' = (Fisher's) kurtosis.
 |          (default is 'mv')
 |      
 |      Returns
 |      -------
 |      stats : sequence
 |          of requested moments.
 |  
 |  std(self, *args, **kwds)
 |      Standard deviation of the distribution.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      std : float
 |          standard deviation of the distribution
 |  
 |  support(self, *args, **kwargs)
 |      Return the support of the distribution.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, ... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information).
 |      loc : array_like, optional
 |          location parameter, Default is 0.
 |      scale : array_like, optional
 |          scale parameter, Default is 1.
 |      Returns
 |      -------
 |      a, b : float
 |          end-points of the distribution's support.
 |  
 |  var(self, *args, **kwds)
 |      Variance of the distribution.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      var : float
 |          the variance of the distribution
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from scipy.stats._distn_infrastructure.rv_generic:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  random_state
 |      Get or set the RandomState object for generating random variates.
 |      
 |      This can be either None, int, a RandomState instance, or a
 |      np.random.Generator instance.
 |      
 |      If None (or np.random), use the RandomState singleton used by np.random.
 |      If already a RandomState or Generator instance, use it.
 |      If an int, use a new RandomState instance seeded with seed.

Note

Log normal distribution is specified by 3 parameters

  • \(\sigma\) - Shape parameter or Standard deviation of Distribution

  • \(m\) - Scale parameter (median) shrinking or stretching the graphs

  • \(\theta\) or (\(\mu\)) - Location parameter ; specifying where it is located in map

For log normal distribution. Check following links

rv = stats.lognorm(1, 0, 50) # Don't know shape , location , median
# xs = np.linspace(1, 100, 100)
ys = rv.rvs(10000)
plt.hist(ys)
(array([9.308e+03, 5.550e+02, 8.500e+01, 3.000e+01, 1.000e+01, 3.000e+00,
        4.000e+00, 3.000e+00, 0.000e+00, 2.000e+00]),
 array([1.12258114e+00, 2.17935158e+02, 4.34747735e+02, 6.51560312e+02,
        8.68372890e+02, 1.08518547e+03, 1.30199804e+03, 1.51881062e+03,
        1.73562320e+03, 1.95243578e+03, 2.16924835e+03]),
 <BarContainer object of 10 artists>)
../../_images/02_sampling_11_1.png
# For women weight
weight = stats.lognorm(0.23, 0, 70.8)
weight.mean(), weight.std()
(72.69764573296688, 16.944043048498038)
xs = np.linspace(20, 160, 100)
ys = weight.pdf(xs)

plt.plot(xs, ys, linewidth=4, color=COLOR1)
plt.xlabel('weight (kg)')
plt.ylabel('PDF')
plt.show()
../../_images/02_sampling_13_0.png
def make_sample(n=100):
    sample = weight.rvs(n)
    return sample
sample = make_sample(100)
sample.mean(), sample.std()
(73.04819598338332, 16.22111742103728)
def sample_stat(sample):
    return sample.mean()
sample_stat(sample)
73.04819598338332

Simulating experiment 1000 times

def computing_sampling_distribution(n=100, iters=1000):
    stats = [sample_stat(make_sample(n)) for i in range(iters)]
    return np.array(stats)
sample_means = computing_sampling_distribution(n=100, iters=1000)
plt.hist(sample_means, color=COLOR2)
plt.show()
../../_images/02_sampling_21_0.png
@interact
def sim(n=[10,20,50,100, 200, 1000, 10000], iters=[100, 1000, 10000]):
    sample_means = computing_sampling_distribution(n, iters)
    mean = sample_means.mean()
    std = sample_means.std()
    plt.hist(sample_means, bins=30, color=COLOR2)
    plt.axvline(mean, color=COLOR4)
    plt.axvline(mean -3*std, color=COLOR5)
    plt.axvline(mean +3*std, color=COLOR5)
    
    conf_int = np.percentile(sample_means, [2.5, 97.5]) # 95% confidence interval
    plt.axvline(conf_int[0], color=COLOR1)
    plt.axvline(conf_int[1], color=COLOR1)
    plt.ylabel("count")
    plt.xlabel(f"sample_means(n = {n})")
    plt.title(f" iters={iters},  $\mu_s$={mean:0.4f},  $\sigma_s$={std:0.4f}")
    plt.show()

Application for Sim Monitor#


caption = widgets.Label(value='The values of slider1 and slider2 are synchronized')
sliders1, slider2 = widgets.IntSlider(description='Slider 1'),\
                    widgets.IntSlider(description='Slider 2')
l = widgets.link((sliders1, 'value'), (slider2, 'value'))
display(caption, sliders1, slider2)

Simple plot Linked to output widget#

%matplotlib widget
widget_plot = widgets.Output()
with widget_plot:
    widget_plot.clear_output()
    ax.clear()
    fig, ax = plt.subplots(1,1)
#     display(ax.figure)
    plt.show()
widget_plot
ax.plot(range(1, 100))
fig.canvas.draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-203552d8af35> in <module>
----> 1 ax.plot(range(1, 100))
      2 fig.canvas.draw()

NameError: name 'ax' is not defined
ax.clear()
ax.plot(range(1, 20))
fig.canvas.draw()
# @interact
# def update_plot(n=2):
#     ax.clear()
#     ax.plot(np.linspace(1,100,100)**n)
#     fig.canvas.draw()
widget_plot
widget_pow = widgets.IntSlider(value=2,min=0, max=5)
widget_pow
def update_plot(n=2):
    ax.clear()
    ax.plot(np.linspace(1,100,100)**n)
    fig.canvas.draw()

def handle_change(change):
    update_plot(change.new)
widget_pow.observe(handle_change, "value")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-23-cc35b1b6381a> in <module>
----> 1 widget_pow.observe(handle_change, "value")

NameError: name 'widget_pow' is not defined
widgets.VBox([widget_plot, widget_pow])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-24-cb580b1392d7> in <module>
----> 1 widgets.VBox([widget_plot, widget_pow])

NameError: name 'widget_pow' is not defined

Lognorm Application#

widgets.SelectionSlider(
    options=[1,2, 5, 6, 10],
    value=1,
    description='n',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True
)
def sample_stat(sample, kind='mean'):
    if kind =='mean':
        return sample.mean()
    elif kind == 'coeff_variation':
        return sample.mean()/sample.std()
    elif kind == 'min':
        return sample.min()
    elif kind == 'max':
        return sample.max()
    elif kind == 'median':
        return np.percentile(sample, 50)
    elif kind == 'p10':
        return np.percentile(sample, 10) 
    elif kind == 'p90':
        return np.percentile(sample, 90)
    elif kind == 'IQR':
        return np.percentile(sample, 75) - np.percentile(sample, 25)
    
def make_sample(rv, n=100):
    sample = rv.rvs(n)
    return sample

def computing_sampling_distribution(rv, n=100, iters=1000, kind='mean'):
    stats = [sample_stat(make_sample(rv, n), kind) for i in range(iters)]
    return np.array(stats)
sample_stat(sample, kind='mean')
fig, axes = plt.subplots(nrows=2, ncols=2)
axes
ax0, ax1, ax2, ax3 = axes.flatten()
# def figure(figsize=None):
#     'Temporary workaround for traditional figure behaviour with the ipympl widget backend'
#     fig = plt.figure()
#     if figsize:
#         w, h =  figsize
#     else:
#         w, h = plt.rcParams['figure.figsize']
#     fig.canvas.layout.height = str(h) + 'in'
#     fig.canvas.layout.width = str(w) + 'in'
#     return fig


def update_text(ax, x, y, s, align='right'):
    ax.text(x, y, s,
            horizontalalignment=align,
            verticalalignment='top',
            transform=ax.transAxes)
    
# class KindRange(object):
#     def __init__(self, kind, min_d, max_d):
#         self.kind = kind
#         self.min_d = min_d
#         self.max_d = max_d

class LogNormVisualizer(object):
    
    def __init__(self, shape, loc, median, xs=np.linspace(20, 160, 100)):
        plt.close('all')
        self.shape = shape
        self.loc = loc
        self.median = median
        self.w_shape = widgets.FloatText(value=shape, description='shape',disabled=False)
        self.w_loc = widgets.FloatText(value=loc, description='loc',disabled=False)
        self.w_median = widgets.FloatText(value=median, description='median',disabled=False)
        self.w_n = widgets.SelectionSlider(
                                options=[10,20,50,100, 200, 1000, 10000],
                                value=10,
                                description='n',
                                disabled=False,
                                continuous_update=False,
                                orientation='horizontal',
                                readout=True
                            )
        
        self.w_kind = widgets.Dropdown(
                                    options=['mean', 'coeff_variation', 'min', 'max', 'median', 'p10', 'p90', 'IQR'],
                                    value='mean',
                                    description='Statisitics',
                                    disabled=False,
                                    continuous_update=False,
                                    orientation='horizontal',
                                    readout=True
                                )
        self.kind = self.w_kind.value
        self.w_iters = widgets.SelectionSlider(
                                options=[100,500, 1000, 5000, 10000],
                                value=100,
                                description='iters',
                                disabled=False,
                                continuous_update=False,
                                orientation='horizontal',
                                readout=True
                            )
        self.rv = stats.lognorm(shape, loc, median)
        self.w_rv = widgets.Output()
        plt.close()
        with self.w_rv:
            self.w_rv.clear_output()
            self.fig, axes = plt.subplots(nrows=4, ncols=1)
            self.ax1, self.ax2, self.ax3, self.ax4 = axes
            self.ax1.set_title("Log Normal Distribution")
            self.ax2.set_title("Statistics")
            self.ax3.set_title("Stat_Variation_n")
            self.ax3.set_title("Stat_Variation_iter")
            self.fig.canvas.toolbar_visible = False
            self.fig.canvas.header_visible = False
            self.fig.canvas.footer_visible = False
#             self.fig.canvas.layout.width = '100%'
#             self.fig.canvas.layout.height = '6in'
#             self.fig.canvas.layout.width = '6in'
            self.fig.canvas.resizable = True
            self.fig.canvas.capture_scroll = True
            self.fig.tight_layout()
            
        self.range_store = {}
        self.w_range_min = widgets.FloatText(value=0, description='r_min',disabled=False)
        self.w_range_max = widgets.FloatText(value=100, description='r_max',disabled=False)
        self.w_range_reinit = widgets.Button(disabled=False, description='ReInitialize')
        self.w_rv.layout.display = 'flex-grow'
        self.xs = xs
        self.sample_d = None
        self.r_min = None
        self.r_max = None
        self.df = pd.DataFrame(columns=['n', 'iters', 'kind','mean_d','std_d'])
        self.update_rv()
        self.link_widgets()
        
#         self.update_rv_plot()
#         self.update_stat_plot()
        
    def init_kind_range_store(self, kind, r_min, r_max):
        self.update_range_store(kind, r_min, r_max)
        self.w_range_min.value = r_min
        self.w_range_max.value = r_max
            
    def update_range_store(self, kind, r_min, r_max):
        self.range_store[kind] = (r_min, r_max)
    
    def link_widgets(self):
        self.w_shape.observe(self.handle_shape_change, "value")
        self.w_loc.observe(self.handle_loc_change, "value")
        self.w_median.observe(self.handle_median_change,'value')
        self.w_n.observe(self.handle_n_change, 'value')
        self.w_iters.observe(self.handle_iters_change, 'value')
        self.w_kind.observe(self.handle_kind_change, 'value')
        self.w_range_min.observe(self.handle_w_range_min_change, 'value')
        self.w_range_max.observe(self.handle_w_range_max_change, 'value')
        self.w_range_reinit.on_click(self.handle_click)
        
#         plt.show()
    
    
    def handle_click(self, b):
        mean = self.sample_d.mean()
        std = self.sample_d.std()
        self.r_min = np.floor(mean-4*std)
        self.r_max = np.ceil(mean+4*std)
        self.init_kind_range_store(self.kind, self.r_min, self.r_max)
        plt.show()
        
    def reset(self):
        self.update_range_store(self.kind, self.r_min, self.r_max)
        self.ax2.set_xlim(self.r_min, self.r_max)
        plt.show()
#         self.fig.canvas.draw()
#         plt.draw()
        
    def handle_w_range_max_change(self, change):
        self.r_min = self.w_range_min.value
        self.r_max = change.new
        self.reset()
        
    def handle_w_range_min_change(self, change):
        self.r_min = change.new
        self.r_max = self.w_range_max.value
        self.reset()

#         plt.show()
    
    def handle_shape_change(self, change):
        self.shape = change.new
        self.update_rv()
        
    def handle_loc_change(self, change):
        self.loc = change.new
        self.update_rv()
        
    def handle_median_change(self, change):
        self.median = change.new
        self.update_rv()
        
    def handle_n_change(self, change):
        self.n = change.new
        self.update_stat_plot()
        self.update_var_plot()
        
    def handle_iters_change(self, change):
        self.iters = change.new
        self.update_stat_plot()
        self.update_var_plot()
        
    def handle_kind_change(self, change):
        self.kind = change.new
        self.update_stat_plot()
        self.ax3.clear()
        self.update_var_plot()
        
    def update_rv_plot(self):
#         self.ax1.set_xlim([np.random.randint(100),100])
        self.ax1.clear()
        self.ax1.set_title("Log Normal Distribution")
        self.ys = self.rv.pdf(self.xs)
        self.ax1.plot(self.xs, self.ys, linewidth=4, color=COLOR1)
        update_text(self.ax1, 0.97, 0.95,  f"$\sigma$={self.shape}")
        update_text(self.ax1, 0.97, 0.85,  f"$loc$={self.loc}")
        update_text(self.ax1, 0.97, 0.75,  f"$\mu$={self.median}")
        plt.show()
        
    def update_stat_plot(self):
        self.ax2.clear()
        self.ax2.set_title(f"Statistics[{self.w_kind.value}]")
        self.sample_d = computing_sampling_distribution(self.rv, self.w_n.value, 
                                                       self.w_iters.value, 
                                                       kind=self.w_kind.value)
        mean = self.sample_d.mean()
        std = self.sample_d.std()
        self.r_min = np.floor(mean-4*std)
        self.r_max = np.ceil(mean+4*std)
        a,b = np.percentile(self.sample_d,[10, 90])
        self.ax2.hist(self.sample_d, bins=30, color=COLOR2)
        self.ax2.axvline(mean, color=COLOR4)
        self.ax2.axvline(mean -3*std, color=COLOR5)
        self.ax2.axvline(mean +3*std, color=COLOR5)
        self.ax2.axvline(a, color=COLOR3)
        self.ax2.axvline(b, color=COLOR3)
        self.df.loc[self.df.shape[0]] = [self.w_n.value, self.w_iters.value, self.w_kind.value, mean, std]
        if self.kind in self.range_store:
            self.ax2.set_xlim(*self.range_store[self.kind])
            pass
        else:
            self.init_kind_range_store(self.kind, self.r_min, self.r_max)
            self.ax2.set_xlim(self.r_min, self.r_max)
        update_text(self.ax2, 0.03, 0.95,  f"$\sigma_d$={std:0.2f}", align='left')
        update_text(self.ax2, 0.03, 0.85,  f"$\mu_d$={mean:0.2f}", align='left')
        update_text(self.ax2, 0.97, 0.95,  f"$p10_d$={a:0.2f}")
        update_text(self.ax2, 0.97, 0.85,  f"$p90_d$={b:0.2f}")
        plt.show()
        
        
    def update_var_plot(self):
#         self.df.plot.scatter('n', 'value', ax=self.ax3)
        sel = self.df[self.df.kind==self.kind]
        sel.plot.scatter(x='n', y='mean_d', s=sel['std_d']*200, ax=self.ax3, logx=True)
        self.ax3.set_ylim(*self.range_store[self.kind])
        sel.plot.scatter(x='iters', y='mean_d', s=sel['std_d']*200, ax=self.ax4, logx=True)
        self.ax4.set_ylim(*self.range_store[self.kind])
#         sns.scatterplot()
        
        plt.show()
        
    def update_rv(self):
        self.rv = stats.lognorm(self.shape, self.loc, self.median)
#         self.w_rv.clear_output()
        self.update_rv_plot()
        self.update_stat_plot()
        self.update_var_plot()
        
    def view(self):
#         with self.w_rv:
#             display(self.ax.figure)
        self.widget_setter_label = widgets.Label("Parameters", position='center')
        self.widget_setter = widgets.VBox([self.widget_setter_label,
                                           widgets.HBox([self.w_shape, self.w_loc, self.w_median])])
        self.widget_simcontrol = widgets.HBox([self.w_n, self.w_iters, self.w_kind])
        self.range_control = widgets.HBox([self.w_range_min, self.w_range_max, self.w_range_reinit])
        ui = widgets.VBox([self.widget_setter,self.widget_simcontrol,self.range_control, self.w_rv])
        return ui
lnv = LogNormVisualizer(0.23, 0, 70.8)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-34708152cec1> in <module>
----> 1 lnv = LogNormVisualizer(0.23, 0, 70.8)

<ipython-input-25-7235e5af6dae> in __init__(self, shape, loc, median, xs)
     93         self.r_max = None
     94         self.df = pd.DataFrame(columns=['n', 'iters', 'kind','mean_d','std_d'])
---> 95         self.update_rv()
     96         self.link_widgets()
     97 

<ipython-input-25-7235e5af6dae> in update_rv(self)
    233 #         self.w_rv.clear_output()
    234         self.update_rv_plot()
--> 235         self.update_stat_plot()
    236         self.update_var_plot()
    237 

<ipython-input-25-7235e5af6dae> in update_stat_plot(self)
    190         self.ax2.clear()
    191         self.ax2.set_title(f"Statistics[{self.w_kind.value}]")
--> 192         self.sample_d = computing_sampling_distribution(self.rv, self.w_n.value, 
    193                                                        self.w_iters.value,
    194                                                        kind=self.w_kind.value)

TypeError: computing_sampling_distribution() got an unexpected keyword argument 'kind'
../../_images/02_sampling_46_1.png
lnv.rv.mean(), lnv.rv.std()
# df = lnv.df
# df.loc[df.shape[0]] = [1,2,3,4]; df
lnv.view()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-27-73bcf06dbfee> in <module>
----> 1 lnv.view()

NameError: name 'lnv' is not defined
# lnv.df[lnv.df.kind=='mean'].plot.scatter(x='n',y='std_d',  ax=lnv.ax3)
lnv.df[lnv.df.kind=='mean'].plot.scatter(x='n',y='std_d', s='mean_d', ax=lnv.ax3)
# lnv.df
np.percentile?
# plt.xlim?
w = widgets.IntText()
l = widgets.IntText()
t = 2
def handle_change(change):
    t = change.new
    l.value = t
    
w.observe(handle_change, "value")
widgets.VBox([w,l])
# widgets.Label?
# display?

Estimating PI Computationally#

  • Area of circle => \(\pi r^2\)

  • Area of square of size 2r => 4r^2

rv = stats.uniform(-1,2)
%matplotlib inline
plt.plot(np.linspace(1,10,10), rv.rvs(10))
plt.show()
../../_images/02_sampling_63_0.png
@interact
def estimate_pi(n=(1,1000000)):
    x = stats.uniform(-1,2)
    y = stats.uniform(-1,2)
    dist = np.sqrt( x.rvs(n)**2+y.rvs(n)**2)
#     print(dist[:5])
    in_circle = dist <= 1
    pi = sum(in_circle)*4/n
    return pi
#     print(f"Estimated PI {pi} with n:{n}")
s = [estimate_pi(10000) for trials in range(1000)]
sample = np.array(s)
plt.hist(sample, bins=40)
plt.show()
../../_images/02_sampling_66_0.png
sample.mean(), sample.std(), np.percentile(sample, [5,95])
(3.1407156, 0.017060112445115943, array([3.1116 , 3.16882]))

Part 2#

What if we don’t know the underlying assumption?

  • Take the Sample

  • Generate Model for Population from Sample

  • Calculate Sampling Statistics by running experiments on this model

sample = [1,2,3,4,5,2,1 ,4]
n = len(sample)
np.random.choice(sample, 
                 n, 
                 replace=True) # New sample from original sample , after this you can run experiment and calculate sampling statistics
array([4, 1, 1, 3, 2, 5, 5, 5])

Resampling#

class Resampler(object):
    def __init__(self, sample, xlim=None, iters=1000):
        self.sample = sample
        self.n = len(sample)
        self.iters = iters 
    
    def resample(self):
        new_sample = np.random.choice(self.sample, self.n, replace=True)
        return new_sample
    
    
    def sample_stat(self, sample):
        return sample.mean()
    
    def computing_sampling_distribution(self):
        stats = [self.sample_stat(self.resample()) for i in range(self.iters)]
        return np.array(stats)
    
    def plot_summary_statistics_distribution(self):
        fig, ax = plt.subplots(1)
        stats = self.computing_sampling_distribution()
        
        mean = stats.mean()
        SE = stats.std()
        
        conf_int = np.percentile(stats, [2.5, 97.5]) # 95% confidence interval
        plt.axvline(mean, color=COLOR1)
        plt.axvline(conf_int[0], color=COLOR1)
        plt.axvline(conf_int[1], color=COLOR1)
#         plt.xlim(mean-4std, )
        plt.hist(stats, color=COLOR4)
        plt.show()
# s = np.random.random([1, 100]).flatten()
s = weight.rvs(10)
rsample = Resampler(s, iters=1000)
rsample.sample.shape
(10,)
# stat = rsample.computing_sampling_distribution()
# stat.shape
rsample.plot_summary_statistics_distribution()
../../_images/02_sampling_78_0.png
 np.random.choice([1,2,3],3, replace=True)
array([3, 1, 3])
plt.plot(s)
[<matplotlib.lines.Line2D at 0x7feaabb10430>]
../../_images/02_sampling_80_1.png
xs = np.linspace(20, 160, 100)
plt.plot(weight.pdf(xs))
[<matplotlib.lines.Line2D at 0x7feaabc22d60>]
../../_images/02_sampling_82_1.png
type(weight)
scipy.stats._distn_infrastructure.rv_frozen
type(sample)
list
class StdResampler(Resampler):
    def sample_stat(self, sample):
        return sample.std()
s = weight.rvs(1000)
rsample = StdResampler(s, iters=1000)
rsample.sample.shape
(1000,)
rsample.plot_summary_statistics_distribution()
../../_images/02_sampling_87_0.png

Part 3#

In this section we will implement above for multiple group problems

mu1, sig1 = 178, 7.7
male_height = stats.norm(mu1, sig1); male_height
<scipy.stats._distn_infrastructure.rv_frozen at 0x7feaab9c58e0>
mu2, sig2 = 163, 7.3
female_height = stats.norm(mu2, sig2); female_height
<scipy.stats._distn_infrastructure.rv_frozen at 0x7feaab9ec4c0>
n = 1000
male_sample = male_height.rvs(1000)
female_sample = female_height.rvs(1000)
male_mean = male_height.mean()
female_mean = female_height.mean()
male_std = male_height.std()
female_std = female_height.std()
difference = (mu1 - mu2)

relative_difference_by_male = difference/mu1*100
relative_difference_by_female = difference/mu2*100
relative_difference_by_male, relative_difference_by_female
(8.426966292134832, 9.202453987730062)
thresh = (male_mean*female_std+female_mean*male_std)/(male_std+female_std); thresh
170.3
male_overlap = sum((male_sample < thresh))/ len(male_sample)
male_overlap 
0.163
female_overlap = sum((female_sample > thresh))/ len(female_sample);
female_overlap
0.139
overlap = (male_overlap + female_overlap)/2; overlap
0.15100000000000002
prob_superiority = sum(male_sample > female_sample)/(len(male_sample)+len(female_sample))
prob_superiority = (male_sample > female_sample).mean()
prob_superiority
0.929
def overlap(grp1_sample, grp2_sample):
    """
        grp1: Control
        grp2: Treatment
    """
    
#     grp1_sample = grp1.rvs(1000)
#     grp2_sample = grp2.rvs(1000)
    m1 = grp1_sample.mean()  #grp1.mean()
    m2 = grp2_sample.mean()  #grp2.mean()
    s1 = grp1_sample.std()  #grp1.std()
    s2 = grp2_sample.std()  #grp1.std()
    thresh = (m1*s2+m2*s1)/(s1+s2)
    
    grp1_overlap = sum(grp1_sample<thresh)/ len(grp1_sample)
    grp2_overlap = sum(grp2_sample>thresh)/ len(grp2_sample)
    misclassification_rate = (grp1_overlap+grp2_overlap)/2
    return misclassification_rate
    
grp1_sample = male_height.rvs(1000)
grp2_sample = female_height.rvs(1000)

overlap(grp1_sample, grp2_sample)
0.166
def prob_superior(grp1_sample, grp2_sample):
    # Assumes same size 
    assert len(grp1_sample) == len(grp2_sample)
    return(grp1_sample>grp2_sample).mean()
prob_superior(grp1_sample, grp2_sample)
0.932
grp1_sample.var()
58.99469052969996
def cohen_effect(grp1_sample, grp2_sample):
    diff = grp1_sample.mean() - grp2_sample.mean()
    var1, var2 = grp1_sample.var(), grp2_sample.var()
    n1, n2 = len(grp1_sample), len(grp2_sample)
    pooled_var = (n1*var1+n2*var2)/(n1+n2)
    d = diff/np.sqrt(pooled_var)
    return d
cohen_effect(grp1_sample, grp2_sample)
1.9942476570475194
def color_data(inp, CI):
    a, b = CI
    t1 = inp< a
    t2= inp>b                                       
    color_data = (t1+t2).astype(np.int)
    return color_data

class MultiGrpResampler(object):
    def __init__(self, sample1, sample2, xlim=None, iters=1000, summary_stat='cohen'):
        self.sample1 = sample1
        self.sample2 = sample2
        self.xlim = xlim
        self.iters = iters
        self.summary_stat = summary_stat
    
    def resample(self):
        new_sample1 = np.random.choice(self.sample1, len(self.sample1), replace=True)
        new_sample2 = np.random.choice(self.sample2, len(self.sample2), replace=True)
        return new_sample1, new_sample2
    
    def calc_summary_stat(self, sample1, sample2):
        if self.summary_stat == 'cohen':
            return cohen_effect(sample1, sample2)
        if self.summary_stat == 'overlap':
            return overlap(sample1, sample2)
        if self.summary_stat == 'prob_superiority':
            return prob_superior(sample1, sample2)
        
        
    def compute_sampling_distribution(self):
        summary_stats = [self.calc_summary_stat(*self.resample()) for i in range(self.iters)]
        return np.array(summary_stats)
    
    
    def sampling_distribution(self):
        summary_stats = self.compute_sampling_distribution()
        
        mean = summary_stats.mean()
        std = summary_stats.std()
        CI = np.percentile(summary_stats, [5, 95]) # 90% confidence interval
        
        return  summary_stats, mean, std, CI
    
    
    def plot_sampling_distribution(self):
        summary_stats, mean, std, CI = self.sampling_distribution()
        bins = 30
#         hist_data = np.histogram(summary_stats, bins)[1]
        x_sc = LinearScale()
        y_sc = LinearScale()
        col_sc = ColorScale(colors=['MediumSeaGreen', 'Red'])
        y,edges = np.histogram(summary_stats, bins)
        centers = 0.5*(edges[1:]+ edges[:-1])
        cdata = np.array(col_sc.colors)[color_data(centers, CI)].tolist()
        ax_x = Axis(scale=x_sc, tick_format='0.3f')
        ax_y = Axis(scale=y_sc, orientation='vertical')
        vline_mean = pltbq.vline(mean, stroke_width=2, colors=['orangered'], scales={'y': y_sc, 'x': x_sc, 'color': col_sc})
        vline_a = pltbq.vline(CI[0], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc, 'color': col_sc})
        vline_b = pltbq.vline(CI[1], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc, 'color': col_sc})
        hist = Hist(sample=summary_stats, scales={'sample': x_sc, 'count': y_sc, 'color': col_sc}, bins=bins, colors=cdata)
        fig = Figure(marks=[hist, vline_mean, vline_a, vline_b], axes=[ax_x, ax_y], padding=0, 
                     title=f'Sampling Distribution {self.summary_stat}' )
#         print(fig.marks[0].colors)
        
        return fig
        
        
        
    
sample1 = male_height.rvs(1000)
sample2 = female_height.rvs(1000)

rsampl = MultiGrpResampler(sample1, sample2,)
summary_stats, mean, std, CI = rsampl.sampling_distribution()
mean, std, CI
(1.988103230980105, 0.05500478400582647, array([1.89740944, 2.07562714]))
def color_data(inp, CI):
    a, b = CI
    t1 = inp< a
    t2= inp>b                                       
    color_data = (t1+t2).astype(np.int)
    return color_data
x_sc = LinearScale()
y_sc = LinearScale()
col_sc = ColorScale(colors=['MediumSeaGreen', 'Red'])
# hist = Hist(data=summary_stats, scales={'summary_statistics': x_sc, 'count':y_sc})

hist = Hist(sample=summary_stats, scales={'sample': x_sc, 'count': y_sc, 'color': col_sc})
len(hist.midpoints)
0
ax_x = Axis(scale=x_sc, tick_format='0.2f')
ax_y = Axis(scale=y_sc, orientation='vertical')
ax_x
Axis(scale=LinearScale(), tick_format='0.2f')
color_data(hist.midpoints, CI)
array([], dtype=int64)
np.array(col_sc.colors)[color_data(hist.midpoints, CI)].tolist()
[]
hist.bins = 30
hist.colors = np.array(col_sc.colors)[color_data(hist.midpoints, CI)].tolist()
hist.colors
[]
# Figure(marks=[hist], axes=[ax_x, ax_y], padding=0, title='Sampling Distribution' )
vline_mean = pltbq.vline(mean, stroke_width=2, colors=['orangered'], scales={'y': y_sc, 'x': x_sc})
vline_a = pltbq.vline(CI[0], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
vline_b = pltbq.vline(CI[1], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
vline_mean, vline_a, vline_b
(Lines(colors=['orangered'], interactions={'hover': 'tooltip'}, preserve_domain={'x': False, 'y': True}, scales={'y': LinearScale(allow_padding=False, max=1.0, min=0.0), 'x': LinearScale()}, scales_metadata={'x': {'orientation': 'horizontal', 'dimension': 'x'}, 'y': {'orientation': 'vertical', 'dimension': 'y'}, 'color': {'dimension': 'color'}}, tooltip_style={'opacity': 0.9}, x=array([2.12200944, 2.12200944]), y=array([0, 1])),
 Lines(colors=['steelblue'], interactions={'hover': 'tooltip'}, preserve_domain={'x': False, 'y': True}, scales={'y': LinearScale(allow_padding=False, max=1.0, min=0.0), 'x': LinearScale()}, scales_metadata={'x': {'orientation': 'horizontal', 'dimension': 'x'}, 'y': {'orientation': 'vertical', 'dimension': 'y'}, 'color': {'dimension': 'color'}}, tooltip_style={'opacity': 0.9}, x=array([2.02652263, 2.02652263]), y=array([0, 1])),
 Lines(colors=['steelblue'], interactions={'hover': 'tooltip'}, preserve_domain={'x': False, 'y': True}, scales={'y': LinearScale(allow_padding=False, max=1.0, min=0.0), 'x': LinearScale()}, scales_metadata={'x': {'orientation': 'horizontal', 'dimension': 'x'}, 'y': {'orientation': 'vertical', 'dimension': 'y'}, 'color': {'dimension': 'color'}}, tooltip_style={'opacity': 0.9}, x=array([2.21439821, 2.21439821]), y=array([0, 1])))
Figure(marks=[hist, vline_mean, vline_a, vline_b], axes=[ax_x, ax_y], padding=0, title='Sampling Distribution 2' )

Function Call#

bins=30
y,edges = np.histogram(summary_stats, bins)
centers = 0.5*(edges[1:]+ edges[:-1])
cdata = np.array(col_sc.colors)[color_data(centers, CI)].tolist()
cdata
['Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red']
n = 100000
sample1 = male_height.rvs(n)
sample2 = female_height.rvs(1000)

rsampl = MultiGrpResampler(sample1, sample2,summary_stat='cohen')
fig = rsampl.plot_sampling_distribution()
# fig.marks[0].colors = cdata
fig
fig.marks[0]
Hist(bins=30, colors=['Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red'], interactions={'hover': 'tooltip'}, scales={'sample': LinearScale(), 'count': LinearScale(), 'color': ColorScale(colors=['MediumSeaGreen', 'Red'])}, scales_metadata={'sample': {'orientation': 'horizontal', 'dimension': 'x'}, 'count': {'orientation': 'vertical', 'dimension': 'y'}}, tooltip_style={'opacity': 0.9})
fig.marks[0].bins
30
fig
summary_stat = 'cohen'
summary_stats, mean, std, CI = rsampl.sampling_distribution()
x_sc = LinearScale()
y_sc = LinearScale()
col_sc = ColorScale(colors=['MediumSeaGreen', 'Red'])
ax_x = Axis(scale=x_sc, tick_format='0.2f')
ax_y = Axis(scale=y_sc, orientation='vertical')
hist = Hist(sample=summary_stats, scales={'sample': x_sc, 'count': y_sc, 'color': col_sc}, bins=30)
hist.colors
['steelblue']
vline_mean = pltbq.vline(mean, stroke_width=2, colors=['orangered'], scales={'y': y_sc, 'x': x_sc})
vline_a = pltbq.vline(CI[0], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
vline_b = pltbq.vline(CI[1], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
fig = Figure(marks=[hist, vline_mean, vline_a, vline_b], axes=[ax_x, ax_y], padding=0, 
             title=f'Sampling Distribution {summary_stat}' )
fig
hist.bins = 30
hist.colors = np.array(col_sc.colors)[color_data(hist.midpoints, CI)].tolist()
hist.colors
[]
fig
summary_stat = 'cohen'
summary_stats, mean, std, CI = rsampl.sampling_distribution()
x_sc = LinearScale()
y_sc = LinearScale()
col_sc = ColorScale(colors=['MediumSeaGreen', 'Red'])
ax_x = Axis(scale=x_sc, tick_format='0.2f')
ax_y = Axis(scale=y_sc, orientation='vertical')
hist = Hist(sample=summary_stats, scales={'sample': x_sc, 'count': y_sc, 'color': col_sc}, bins=30)
# hist.bins = 30
with hist.hold_sync():
    hist.colors = np.array(col_sc.colors)[color_data(hist.midpoints, CI)].tolist()
vline_mean = pltbq.vline(mean, stroke_width=2, colors=['orangered'], scales={'y': y_sc, 'x': x_sc})
vline_a = pltbq.vline(CI[0], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
vline_b = pltbq.vline(CI[1], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
fig = Figure(marks=[hist, vline_mean, vline_a, vline_b], axes=[ax_x, ax_y], padding=0, 
             title=f'Sampling Distribution {summary_stat}' )
fig
np.histogram([12,3,4,5, 19], bins=3)
(array([3, 1, 1]), array([ 3.        ,  8.33333333, 13.66666667, 19.        ]))
np.histogram?