Getting Started

Univariate Densities
- Simple kernel density estimate
- Densities of weighted samples
Bivariate Densities

To install KernelDensityEstimation.jl, it is recommended that you use the jmert/Registry.jl package registry, which will let you install (and depend on) the package similarly to any other Julia package in the default General registry.

pkg> registry add https://github.com/jmert/Registry.jl

pkg> add KernelDensityEstimation

Univariate Densities

Simple kernel density estimate

For the following example, we'll use a small sample of Gaussian deviates:

using Distributions

x = rand(Normal(3, 0.1), 250)

The key interface of this package is the kde function. In its simplest incantation, you provide a vector of data and it returns a kernel density object (in the form of a UnivariateKDE structure).

using KernelDensityEstimation

K = kde(x)

The density estimate $f(x)$ is given at locations K.x (as a StepRangeLen) with density values K.f. For instance, the mean and variance of the distribution are:

μ1 = step(K.x) * sum(@. K.f * K.x)
μ2 = step(K.x) * sum(@. K.f * K.x^2)

(; mean = μ1, std = sqrt(μ2 - μ1^2))

(mean = 2.9986807890073575, std = 0.11013630728702606)

which agree well with the known underlying parameters $(\mu = 3, \sigma = 0.1)$.

Visualizing the density estimate (see Extensions — Makie.jl), we see a fair level of consistency between the density estimate and the known underlying model.

Densities of weighted samples

In some cases, the data to be analyzed is a weighted vector of data (represented as a vector of data and a corresponding vector of weight factors). For instance, importance sampling of an MCMC chain results in non-uniform weights that then must be considered when deriving a density estimate.

Take the following toy example where we have a target parameter $v$ and nuisance parameter $p$ that are correlated, where a uniform prior was assumed for $p$:

# correlation coefficient and nuisance parameter
ρ, p = 0.85, randn(500)
# correlated target parameter
v = ρ .* p .+ sqrt(1 - ρ^2) .* randn.()

Now suppose that you have reason to update your prior on $p$, believing now that positive values are twice as likely as negative ones. If the method of generating $v$ is expensive, and because the change in prior is not extreme, it may be efficient and acceptable to instead importance sample the existing values by reweighting the samples by the ratio of the priors:

\[\begin{align*} P_1(p) &\propto 1 & P_2(p) &\propto \begin{cases} 1 & p < 0 \\ 2 & p \ge 0 \\ \end{cases} \end{align*}\]

P1(z) = 1.0
P2(z) = z ≥ 0 ? 2.0 : 1.0
weights = P2.(p) ./ P1.(p)

We then simply provide these weights as a keyword argument in the call to kde:

K1 = kde(v)
K2 = kde(v; weights)

As expected, this shifts the resultant density estimate to the right, toward more positive values.

Note

The effective sample size (UnivariateKDEInfo.neffective) is calculated from the weights using Kish's definition. Both of the bandwidth estimators (SilvermanBandwidth and ISJBandwidth) use this definition in scaling the bandwidth with the (effective) sample size.

Bivariate Densities

When two or more parameters are of interest, the bivariate density is commonly used to visualize correlations among the parameters.

As an example of such correlations, take the following simulation which generates a noisy dataset around a known line, and we use ordinary least-squares regression of the data to obtain the slope and intercept of the best-fit line. We visualize the accumulated distribution of the recovered fit parameters.

Nsamp = 1_000
Npts = 10

# simulation
x = range(5, 15, length = Npts)
D = [x.^0 x.^1]  # design matrix
yobs = (0.2 .* x .+ 3.0) .+ rand(Normal(0, 2.2), Npts, Nsamp)

# fitting
params = (D'D) \ (D'yobs)
offsets = @view params[1, :]
slopes = @view params[2, :]

Each of the vectors offsets and slopes contain instances of the parameter, correlated over simulation realizations. We can produce the 1D density estimates as above, and we can create the 2D density estimate by simply providing both vectors (with the order of arguments corresponding to each increasing dimension in the output density).

K_slope = kde(slopes)
K_offset = kde(offsets)
K_both = kde(slopes, offsets)

From the 2D density, we clearly see (expected) correlation between the slope and offset parameters — a relatively low slope results in a higher offset and vice versa.

Important

Higher-order density estimates are also supported by just adding additional vector arguments to the kde() call, but the support and utility is limited. As they are also harder to usefully visualize, they will not be discussed further here.