Luk Arbuckle

Centroid estimation in discrete high-dimensional spaces

In estimation on 1 June 2008 at 11:15 pm

A point estimate of a parameter is a single number intended to be as close as possible to the true value of the parameter. It’s unlikely to be exactly equal to the parameter it’s trying to estimate—although it is the single most probable solution—but it’s an important starting point for constructing a confidence interval.

The method of maximum likelihood is a general method of obtaining single point estimators with three desirable properties (at least in low-dimensional continuous spaces):

  • consistency (convergence in probability)
  • normality (normally distributed about the estimate)
  • efficiency (minimum variance)

But these properties only hold asymptotically (i.e., they’re properties that exist in the limit), and only properly for continuous variables—they are not achieved for high-dimensional discrete unknowns.

A discrete high-dimensional sample space is partitioned into so many parts that a single point estimator will likely have very low probability. In previous work it was hoped that the single point estimator would be surrounded by similar solutions that would together form a greater “probability mass”. Examples exist, however, that demonstrate that this is not always the case.

Where’s the point?
In a publicly available article published in the Proceedings of the National Academy of Sciences, researchers Luis Carvalho and Charles Lawrence at Brown University discuss a class of “centroid” estimators they developed to be more representative of the information contained in discrete high-dimensional sample spaces. The centroid estimator is proven to minimize differences between the parameter and the estimate (for important loss functions), and to be the closest point to the mean.

The authors highlight published results that suggest these alternative estimators offer improved representation of data in practice, and provide some interesting examples from computational biology. They warn the reader, however, that only a few applications have been studied so far, and feasability has not been shown for all cases.

In the concluding remarks the authors share some insight into future challenges:

Rapid improvements in data acquisition technologies promise to continue to dramatically increase the pool of data in many fields. Although these data will be of great benefit, they also have opened a new universe of high-dimensional inference and prediction problems that will likely provide major data analytic challenges in the coming decades. Among these is the development of point estimators in discrete spaces that are the focus of the centroid estimators developed here.

But the more general point estimation challenge is to find one or a small number of feasible solutions among the many in the ensemble that is by some appropriate measure representative of the full ensemble and suitable for the data structural features of the solution space. These new high-dimensional data and unknowns will also almost certainly force a reexamination of extant approaches to interval estimation, hypothesis tests, and predictive inference.

  1. I read this article but I don’t understand because in the theorem 3, says that the centroid estimator is the closest point to the mean.
    Thanks for a possible response.

  2. I believe they are using mean and center of mass interchangeably.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: