Histograms and density estimation

It is a common practice in the statistical analysis of data to group the samples into bins or categories.  The goal is usually to extract some structure that is useful and often hidden from us when looking at direct plots of the sample set.

For instance, a time series plot of the change in stock price may appear erratic and contain no order, but observing the frequency of certain price-changes may help us realize that on average there is no change as well as give us an idea as to the practical bounds to how much the price will change.

The purpose of this note is to speak a little to the role of histograms in statistics and probability.   While a variety of complications arise when using them for the estimation of a probability density function, they serve as a very practical (and efficient) tool in exploratory data analysis and provide a grounding for other forms of density estimation.

Histogram Construction

  • For a given a sample, what bin does it belong to?
  • How many samples are in a given bin?
  • What is the numerical range or volume of a given bin?

From here forward, we consider data samples which can lie in some region of the real number line (e.g. integers, fractions, and roots of integers). We’ll represent a sample of numbers by the vector \mathbf{x}, say \mathbf{x}=(1.2, 3.002, 1, 1, 2).

Basic histogramThe diagram outlines some relevant quantities of a histogram.  The minimum and maximum values of the sample (1 and 3.002 in the example above) help define the range that we will distribute our bins in.  We represent the bin width as h.  It is not required that a histogram have fixed-width bins, but this is one of the most commonly used and useful formulations.

The starting offset (0 \leq b < h) is also a useful quantity.  For instance, if the first bin starts at \min(\mathbf{x}) or b=0, any number falling in the range (\min(\mathbf{x}) \leq x < h) contributes to the quantity of elements in that bin.  For h=0.3, the range of the first bin in the example above is (1 \leq x < 1.3).  The two 1s and 1.2 will fall in the first bin.

Suppose we shift the bins a little to the left (i.e. b=0.1).  The range of the first bin is now (.9 \leq x < 1.2).  The two 1s fall in the first bin while 1.2 falls in the second.  So it is easy to see how both the bin width and offset affect our view of the histogram.

The total number of bins in a histogram can be represented as:

m=\left \lceil \frac{\max{(\mathbf{x})}-\min{(\mathbf{x}})}{h} \right \rceil + 1

where the \lceil x \rceil represents the next integer larger than x, and this follows analogously for \lfloor x \rfloor.  We add 1 to the traditional bin count to accommodate our offset b.

Integer bins in histogram

When constructing a histogram on a computer, we desire a function \text{bin}(x) that maps a sample to an appropriate bin index k=(0,1,\ldots,m-1).  Assuming that we begin the bins at 0, the bin index can be determined by dividing the sample from the bin width: \lfloor x/h \rfloor.  For h=0.3, it’s easy to see that numbers less than 0.3 will fall in bin 0, while a number twice as large will fall in the second bin, and so on.  Accounting for the offset, we have:

\text{bin}(x) = \left \lfloor \frac{x-\min{(\mathbf{x})}-b}{h} \right \rfloor

We also desire a function that will allow us to compute the start of the interval for a given bin index:

\text{interval}(k) = kh + \min{(\mathbf{x})}-b

The formulation above allows us to experiment with the two essential qualities of a histogram (bin width and offset) and it should be straightforward to see how this simple approach generalizes for multi-dimensional histograms of equal bin widths, except we then consider more generally the bin volume (e.g in the three-dimensional case the volume v=h^3).

In the terminology of computer science, these are constant time procedures.  If the bin widths are not fixed and arbitrarily set by the user then a more complicated method is required.  One can use a specialized function or binary search for finding the appropriate bin in this case [1].   Histograms can also accept a  variety of types (e.g. integers, single- or double-precision floating-point numbers) as inputs and if the compiler knows about this information ahead of time, it can optimize its code even further while providing a general interface of accumulators [2].

Histograms as density estimates

We now turn to a more mathematical perspective on histograms.  It is natural to treat a histogram as the estimation of a probability density function.  Once armed with the density function a variety of tools from probability and information theory are at our disposal, but we must take caution that the estimates of the density function hold true to the data sample and question whether we should be estimating the densities in the first place to solve the desired problem.

To get a taste for what this is like, we will derive the maximum likelihood estimator for a mathematical step function that represents the histogram.

Step function view of a histogram

Here, the function has m parameters: \mathbf{c}=(c_0,c_1,\ldots,c_{m-1}).  Provided our data sample and a given bin k, what is the optimal height, c_k^*, of the step function?

To find an estimate of the density function p_H(x; \mathbf{c}) we express the likelihood of our sample of data given the model parameters as:

\mathcal{L}(\mathbf{x}|\mathbf{c}) = \displaystyle \prod_{i=0}^{n-1}p_H(x_i; \mathbf{c}) = \displaystyle \prod_{k=0}^{m-1} c_k^{n_k}

Here, n_k represents the number of observed samples in bin k.  In other words, we’d like find c_k^* that maximizes the probability that we’ll find n_k items in bin k.  This optimization is subject to the density function constraint:

\displaystyle \sum_{k=0}^{m-1} c_k v_k = 1

where v_k is the volume of the given histogram region.

Constrained optimizations of this form can be solved using the method of Lagrange multipliers.  In this case, we have a multidimensional function:

f(\mathbf{c}) = \log \mathcal{L}(\mathbf{x}|\mathbf{c}) = \displaystyle \sum_{k=0}^{m-1} \log c_k^{n_k} = \displaystyle \sum_{k=0}^{m-1} n_k\log c_k

subject to the constraint that:

g(\mathbf{c})= 1- \displaystyle \sum_{k=0}^{m-1} c_k v_k = 0

Note that we took the log of the likelihood function so we can deal with sums rather than products.  This will come in handy when taking derivatives in a moment and is a common practice.

The gradient of a multidimensional function results in a vector field where each point–in our case represented by \mathbf{c}–has an associated vector of steepest ascent or descent on the multidimensional surface [3].  Our goal is to find where the steepest ascent of f(\mathbf{c}) is equal in magnitude and direction to that of g(\mathbf{c}).  This equality is represented as:

\nabla f(\mathbf{c}) = \lambda \nabla g(\mathbf{c})

And we can express this combined optimization with:

\nabla \Lambda(\mathbf{c}, \lambda) = \nabla f(\mathbf{c}) - \lambda \nabla g(\mathbf{c}) = 0

where \Lambda(\mathbf{c}, \lambda) is called the “Lagrangian”.  Computing the derivatives:

\frac{\partial \Lambda(\mathbf{c}, \lambda)}{\partial c_k} = \frac{n_k}{c_k} - \lambda v_k = 0

From this, we can see that

\frac{1}{\lambda}n_k = c_kv_k

Since \sum_{k=0}^{m-1} c_k v_k = 1 and \sum_{k=0}^{m-1} n_k = n, \lambda=n.  Thus our optimal height and estimate of the density is:

c_k^*= \hat{p}_H(x) = \frac{n_k}{nv_k}, \; \text{for all } x \text{ in bin } k

While this estimator is actually quite an intuitive result, it is important to observe that this only in the special case of step functions with varying volumes predefined.  What if, for example, we wanted to see what the best estimator is given a piecewise linear function or a spline between bins?  Or what if we want to find out what the bin volumes v_k should be as well?  This question motivates a general discussion on the estimation of the probability density function given the data sample \mathbf{x}, and these “data-driven” approaches are dubbed non-parametric [4].

Similar issues as those brought up in density estimation also arise in the estimation of mutual information [5].

—–

[1] Given a sample, start at the center-most bin.  If the sample is greater than that value, then start at the center most bin of the right half, otherwise do the same to the left.  Recurse this procedure until the desired bin is found.  The running time of this scales like \log_2(m).  The GNU scientific library provides utilities in this spirit

[2] See the C++ Boost Accumulator library for more information on this.

[3] For a more detailed tour of why this is as well as an intuitive contour plot geometrical interpretation of this, consult an introductory applied math text.  Setting f=c, we can study the function’s “surface contours”.  By definition, a vector \mathbf{x} along this contour does not change the value of f, therefore \nabla \cdot \mathbf{x} = 0.  Because of this, \nabla f must be normal to \mathbf{x}.

[4] Scott DW, Sain SR. “Multidimensional density estimation”. Handbook of Statistics: Data Mining and Computational Statistics 23. 297-306 (2004). Gentle JE. “Nonparametric estimation of probability density functions”. Elements of Computational Statistics. 205-231 (2002).

[5] Paninski L. “Estimation of entropy and mutual information”. Neural Computation 15. 1191-1253 (2003). Fraser AM, Swinney HL. “Independent coordinates for strange attractors from mutual information”. Physical Review A 33. 1334-1140 (1986).

3 comments

  1. Erik · November 17, 2008

    I like the new blag.

    …a time series plot of the change in stock price may appear erratic and contain no order, but observing the frequency of certain price-changes may help us realize that on average there is no change

    …further depressing the poor investors. Seriously, Geet, can’t you find a statistical method that will fool us into thinking we’re making money?

  2. nequitans · November 17, 2008

    Ha, I should have qualified that statement with a phrase like “daily change ratio”.

    [Scott DW, Sain SR. “Multidimensional density estimation”. Handbook of Statistics: Data Mining and Computational Statistics 23. 297-306 (2004).]

  3. nequitans · December 3, 2008

    Update: \mathcal{L}(\mathbf{c}|\mathbf{x}) was changed to \mathcal{L}(\mathbf{x}|\mathbf{c}) . The original notation implies the “chance of the model parameters given the data” and what we’re interested in is maximizing “likelihood of our observed data given some model parameters”. We therefore always keep our data set fixed and change the parameters so that the overall probability is maximized. This notational difference may seem pedantic, but it is important from the perspective of statistical theory. The latter notation doesn’t require a “prior” probability. Unfortunately, I think some literature can be a little ambiguous on the matter. I like to think of the likelihood itself as a conditional probability, but some literature (especially those that aren’t particularly statistically-bent) will not treat the function like that and place less importance on “what’s given what”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s