Using Unreliable Labels with Active Learning for Efficient Model Training
Nicholas Deas, D. Hudson Smith
(ndeas@clemson.edu) (dane2@clemson.edu)
Nicholas Deas, D. Hudson Smith
(ndeas@clemson.edu) (dane2@clemson.edu)
We examine how a novel workflow using clustering and active learning powered by user feedback can minimize the manual labeling required by the user in topic modeling an unlabeled text corpus.
How can we leverage the use of unreliable labels to improve model convergence of models using the Active Learning framework?
Active Learning strategically samples new points for labeling based on predicted informativeness to lessen the overall amount of manual labeling by the researcher/user.
Query Strategies
Query Strategies are the metrics by which the most informative samples are chosen for training by Active Learning Models
Example of Unreliable Label Instance Pool
Unreliable Label Incorporation Methods
Using the sampling metrics above, how to choose and include new instances in training
Dataset: AG News
Description: Text corpus of news article titles and descriptions concatenated from over 2,000 news sources
Class: 4 primary classes (Business, Science/Tech, World, Sports)
Full Size: 30,000 samples from each class
Subset Size: 2,500 samples from each class
Our methods differ from normal Active Learning primarily by the use of unreliable labels, providing the model with basic, imperfect information to begin with.
Testing Metric
AUROC – Threshold independent measurement of performance for model comparisons
Ratio Method – Sample Querying and Incorporation
Active Learning vs. Random Sampling
In this case, Active Learning Strategies do not tend to improve model training and performance
Without alteration, no models using normal Active Learning alone neared either of the two theoretical maximums:
Training Strategies
Alpha Weighting
Alpha tends to lead to extremely large or small sample weights due to training set size, so many models did not have time to fully converge
Alpha – the relative sample weight of the entire expert label set to the unreliable labels set where alpha/(2-alpha) is the relative weight of the true label set to the unreliable label set
Ratio Sampling
The Ratio Method does not improve the performance of entropy-based models, but greatly improves the performance of KL Divergence models
Active Learning strategies tend to be based on informativeness of the instance pool, which could lead to biased representation of clusters near their borders.
Ratio Method
IBM Watson in the Watt CI
Clemson University