19.4 Active learning
In active learning, the goal is to identify the true predictive mapping by querying as few points as possible. There are three variants.
In query synthesis, the algorithm gets to choose any input and can ask for its corresponding output .
In pool-based active learning, there is large, but fixed, set of unlabeled points and the algorithm gets to ask a label for one or more of these points
Finally, in stream-based active learning the incoming data is arriving continuously, and the algorithm must choose whether it wants to request a label for the current input or not.
They are various closely related problems. In Bayesian optimization the goal is to estimate the location of the global optimum in as few queries as possible. Typically, we fit a surrogate (response surface) model to the intermediate queries, to decide which question to ask next.
In experiment design, the goal is to estimate , using as little data as possible (this can thought of as an unsupervised form of active learning).
In this section, we discuss the pool based approach.
19.4.1 Decision-theoretic approach
We define the utility of querying in terms of the value of information. We define the utility of issuing query as:
where is the posterior expected loss of taking some future action given the data observed so far.
Unfortunately, evaluating for each is quite expensive, since for each possible response we might observe, we have to update our beliefs given to see what effect it might have on our future decisions (similar to look ahead search technique applied to belief states).
19.4.2 Information-theoretic approach
In the information-theoretic approach, we avoid using a task specific loss function and focus on learning our model as well as we can.
It has been proposed to define the utility of querying in terms of information gain about the parameter , ie the reduction in entropy:
We used the symmetry of the mutual information to get (3), and the advantage of this approach is that we now only have to reason about the uncertainty of the predictive distribution over and not over .
Note that this objective is identical to the expected change in the posterior over the parameter:
Eq (3) has an interesting interpretation: the first term prefers example for which there is uncertainty in the label. Just using it as a criterion selection is called maximum entropy sampling.
However, this can have problems with ambiguous or mislabeled examples. The second term in the equation will discourage this behavior, since it prefers examples that are fairly certain once we know
In other words, this equation will select examples for which the model makes confident predictions which are highly diverse. This approach is called Bayesian active learning by disagreement (BALD)
This methods can be used to train classifiers where expert labels are hard to acquire, such as medical images.
19.4.3 Batch active learning
So far, we have assumed a greedy strategy, in which we select a single example . Sometimes, we have a budget to select a set of samples.
In this case, the information criterion becomes:
Unfortunately, optimizing this is NP-hard in the horizon length .
Fortunately, the greedy strategy is near-optimal in certain conditions.
First note that, for any given , the information gain function is:
maps a set of labels to a scalar.
It is clear that , and that is non-decreasing, i.e. , due to the “more information never hurts” principle.
It has also been shown that is submodular.
As a consequence, a sequential greedy approach is within a constant factor of optimal. If we combine this greedy technique with BALD objective, we get a method called BatchBALD.