Active Learning: Smart Data Labeling with Machine Learning
October 10, 2021
Author: Tan Jamie
A typical supervised learning task necessitates sourcing for a great deal of data, to build complicated models and boost predictive power. These desirable outcomes, however, will not be realized without objectively labelling the data available. In this article, I will discuss some intuition and implementation of a supervised learning model coupled with the active learning algorithm to label data. Active learning optimizes the labelling process, by leveraging the advantages of manual and automatic labelling. Arguably, it is a competitive strategy to produce reliable labels with minimal manpower.
Data Labelling
There are generally 2 approaches to labelling:
1. Manual Labelling
Label tags are attached to data points by a human who is an in-house labeller, crowd sourced or outsourced personnel. However, considering the massive volume of data, manual labelling can be time-consuming, costly, and difficult to coordinate.
2. Automatic Labeling
A separate ML model can be trained to understand raw data and output appropriate label tags. Some relevant learning algorithms include supervised learning, unsupervised learning, and reinforcement learning. Admittedly, some manual work is still required to label a small portion of the data set if a supervised learning model is used. In general, automatic labeling significantly lessens manual workload. But the quality of output labels is heavily dependent on the labeling model’s performance, which can be difficult to improve.
Active Learning Intuition
Active learning is an algorithm that prioritizes informative unlabelled training samples. These samples are extracted for labelling by a user (yes, a human). Subsequently, they are added back into the data set to retrain the labelling model. Priority scores of each data point are recalculated accordingly, and the cycle repeats until a satisfactory model performance is achieved. The final model can then be applied to the remaining un-labelled data. Ultimately, we obtain a fully labelled training data set ready for use!
By intelligently selecting useful samples to label, the manual work required drastically reduces. Furthermore, the curious nature of the algorithm helps obtain a robust final classification model which provides labels to new un-labelled data. In this way, active learning helps attain the best of both manual and automatic labelling.
Remarks:
1. Question! How can we quantify the informativeness of a data point and its priority score?
There are many ways! For instance, through calculating confidence of label prediction, margin, entropy, uncertainty, etc. More about these concepts can be found in Learn Faster with Smarter Data Labelling and Active Learning in Machine Learning.
2. Transfer learning applied here can further improve the labelling model’s performance!
Implementing Active Learning with Superintendent
The code illustrations in this section are contextualized on a project brought together by the Omdena Singapore Chapter, titled Exploring the Impact of Covid-19 on Mental Health in Singapore using NLP. The project kicked off with scraping textual data from multiple sources such as Reddit, Twitter, and YouTube. Labelling came next. One of the goals for this project was to build a topic classifier, thus our labels should be categories of some sort.
Having > 80K posts, limited time, and manpower, it was unrealistic to manually label every post. Many of us in the team favored automatic labelling, thus we attempted topic modelling using Latent Dirichlet Allocation (LDA). Unfortunately, the outcome of that endeavour was not meaningful.
We wanted some certainty about the label tags, yet complete data labelling efficiently. Research led us to active learning, which we pursued. Keeping our goal in mind, we identified “Social Issue Discussion”, “Personal Stories”, “Advocacy”, “Factual Queries/Sharing” and “Others” as label tags. Ultimately, the labels, together with the textual data from posts scraped from social media, can train a topic classifier to categorize new posts into different groups.
Superintendent
We decided to use Superintendent after coming across the author’s presentation at PyData London 2019. It is a library that can implement active learning on ML data and has many useful features that simplify code implementation. These aside, there are many illustrative codes in its documentation, which we referenced heavily. It made the developing process more beginner friendly.
Apart from Superintendent, other software can be roped in using Docker Compose, to make the labelling task more cohesive. This includes:
- Voila – converts Jupyter notebooks into standalone applications. We used this simple front-end interface for labellers to read post contents (our data) then provide suitable tags.
- Adminer – a data management tool that was helpful to access constantly updating datasets, priority scores, etc.
A look into our data set for Active Learning
Description of the data set & its fields name:
- title – Reddit post titles.
- selftext – Raw body texts in Reddit posts that are required for human labellers to comprehend post contents during labelling.
- clean_selftext – selftext that has undergone a cycle of data cleaning. This column contains inputs to train the labelling model.
Step 1: Customize TensorFlow & Voila Images
Docker images are files containing all required code needed for a software program (e.g. Nginx, Redis, etc) to run inside a Docker container. They act like templates and are fundamental blocks that people build on top of. In our case, more dependencies were required. Thus, additional installation on top of the original docker images was executed like so:
tensorflow.Dockerfile image, voila.Dockerfile is available on GitHub (source: Omdena)
# create a layer with TensorFlow Docker image and set it as the base image FROM tensorflow/tensorflow:latest-py3# install additional dependencies needed RUN pip install superintendent RUN pip install sqlalchemy==1.3.24 RUN mkdir /app WORKDIR /app# configure container ENTRYPOINT ["python"]# specify default command to run within an executing container CMD ["app.py"]
We encountered:
- sqlachemy.exc-ArgumentError which was raised due to some known clashes between sqlalchemy v1.3 and v1.4. This was resolved by installing v1.3 instead of v1.4 (note that docker installs the latest version by default if you do not specify).
Step 2: Create docker-compose.yml
docker-compose.yml pulls and initializes images specified in a docker container. It also enables further customization, including volumes that are important for data to persist. For example, ./postgres-data:/var/lib/postgresql/data ensures data stored in the Docker container at /var/lib/postgresql/data is similarly present in the local directory ./postgres-data. In our code, they were also used to make input data accessible in the container and to store trained labelling models, as seen below.
# declare the docker-compose version version: "3.1"# specify all services needed services:db:image: postgres restart: always environment: &environmentPOSTGRES_USER: superintendent POSTGRES_PASSWORD: superintendent POSTGRES_DB: labeling POSTGRES_HOST_AUTH_METHOD: trustvolumes:- ./postgres-data:/var/lib/postgresql/dataports:- "5432:5432"orchestrator:build:context: . dockerfile: tensorflow.Dockerfilerestart: always depends_on:- "db"environment: *environment entrypoint: python /app/orchestrate.py volumes:- ./orchestrate.py:/app/orchestrate.py - ./aldata.csv:/app/aldata.csv - ./models:/app/models
Shortened docker-compose.yml, full code is available on GitHub (source: Omdena)
We encountered:
- Error: Database is uninitialized and superuser password is not specified, which requires us to set a password for the database. However, that did not resolve the error so POSTGES_HOST_AUTH_METHOD: trust was added. This allows connection to the database without any authentication. Consequently, this makes the database insecure, so it is not recommended where sensitive data is concerned.
- Syntax errors, which were debugged with the help of a YAML Validator.
Step 3: Build orchestrate.py
orchestrate.py acts as the brain of the web application, and thus repeatedly runs until docker-compose is terminated. It sets up a connection with the database, reads, and stores the intended input data. When all the required components are running, relevant samples are identified for labelling using active learning strategies. After users provide a suitable label tag, it is updated in the database. Subsequently, these labelled data points are pre-processed before they are fed to the model for training. The model is then evaluated to inform users of its reliability. To adapt the above functions to our task, we added some features to our code, as mentioned below.
???? Code data pre-processor (where relevant)
- title and clean_selftext columns were combined to be used as inputs for training the labelling model.
???? Build the labelling model
- 2 models were built for experimentation, namely, a baseline logistic regression model (using sklearn) and a simple neural network model (using Keras).
- A pipeline was created so data is vectorized before it is fed to the models.
???? Establish a method to evaluate model
- Our evaluation function calculates the cross validation score for labelling models using an in-built Superintendent function, cross_val_score().
- The function was also edited to load newly trained models into the volume created every time these models were evaluated. Conveniently, the saved model files are named “<timestamp when the model is saved> <cross validation score>.pkl” for easy identification and tracking.
- Function to evaluate trained models, full code is available on GitHub (source: Omdena)
def evaluate_model(model, x, y): # calculate cross validation score score = cross_val_score(model, x, y, scoring = "accuracy", cv = 3) # create directory /app/models if it is not already present if not os.path.isdir("/app/models/"): os.mkdir("/app/models") # design filename save_time = time.asctime(time.localtime(time.time())) filename = str(save_time) + " {}".format(score) # save trained model at specified path using filename designed above filepath = "/app/models/{}".format(filename) pickle.dump(model, open(filepath, "wb")) return score
???? Connect code to database previously set up in YAML file
- Utilized Call Level Interface (CLI) with psycopg2 driver for PostgreSQL. Other Database Management Systems (DBMS) can also be used with the application. However, the drivers and database URL formats will vary. More about this at SQLAlchemy Documentation.
???? Create a widget to coordinate the active learning
???? Read in data
???? Configure setting to train in some time interval you specify
Step 4: Design voila-interface.ipynb
The Jupyter notebook is used to style the front-end interface of the web application. For our task, we made a simple interface that displays post contents (the combined title and selftext), label tag options, and a progress bar.
Step 5: Run code on the local machine
- Ensure all files required are present in the same folder, and on the same directory level
- Enter cmd and navigate into the folder
- Run docker-compose up
- With this, our web application became accessible at the ports specified in the YAML file. For instance, if the voila notebook uses port 8686, the front-end interface will run on http://localhost:8686. Take note to use HTTP, not HTTPS.
Deploying Code for Mass Labelling
Cool! We managed to run our scripts using docker-compose on a local machine. But really, it was not useful yet, because the application is not accessible to others for distributed labelling. Thus, this section records a step-by-step illustration of how our code was deployed to AWS EC2 using a free-tier account.
Step 1: Launch an AWS EC2 Instance
Step 2: Connect to the EC2 Instance
On cmd, enter the directory containing the .pem key pair file. Then connect to the instance using the commands below. For Mac:
???? chmod 400 omdena-sg-sample.pem to change access permission
???? ssh -i omdena-sg-sample.pem [email protected]
- omdena-sg-sample.pem is our key pair file
- [email protected] is our instance’s public DNS
For Windows:
We will need PuTTY. It works like the SSH command but is developed specifically for Windows. The instructions to set up PuTTY are here.
Step 3: Install Docker & Docker Compose on Ubuntu
For Docker and Docker Compose to run in Ubuntu, they have to be installed. The steps to do so are extensively explained and illustrated in the articles on DigitalOcean below.
Docker and Docker Compose are installed correctly if running sudo docker ps and docker-compose –version returns an empty table and the version of docker-compose installed respectively.
Step 4: Clone Code from GitHub on Ubuntu
- Ensure code files are uploaded onto a new repository, and in the same level directory
- git clone https://github.com/<username>/<repo name>.git to clone
- Navigate into the code folder using cd <repo name>
Step 5: Finally…
Run sudo docker-compose up! This will take some time, give it a few minutes. After which, the web application can be accessed using the public DNS provided by the EC2 instance as follows:
Remarks
- If the latency of the app is low, consider rebooting the EC2 instance
- Remember to terminate instance on AWS when the application is no longer in use
Closing
Overall, the experience consists of researching about active learning, then implementing it using Superintendent and Docker-Compose, and finally deploying it. At the end of it all, the team managed to run the active learning program and label some data successfully. However, this was not fully implemented in the project primarily due to tight deadlines. That said, the team also had difficulties agreeing on the label tags which we should use.
The complete source code on GitHub: https://github.com/OmdenaAI/omdena-singapore-covid-health/tree/main/src/tasks/task-3-1-data-labelling
Future Enhancements
- Research more about suitable label tags that minimize biases induced.
- Explore other models to label data points, such as SVM, RNN, and Bert.
- Study possible remedies for the effects of imbalanced data during model training.
- Try out other resources available online like Label Studio, modAL, libact, and ALiPy.
*** *** ***
A big thank you to Omdena for the opportunity to work on the Singapore Local Chapter project. Also, a shout-out to all the collaborators who contributed to the project and to my fellow enthusiastic teammates, Rhey Ann Magcalas and Chan Jiong Jiet, who were a great help in exploring active learning.
*** *** ***
References
- Attention Required! | Cloudflare. (2021, May 22). Code Beautify. https://codebeautify.org/yaml-validator
- Engine Configuration — SQLAlchemy 1.4 Documentation. (2021, August 18). SqlAlchemy. https://docs.sqlalchemy.org/en/14/core/engines.html
- Freyberg, J. F. (2018, October 25). Superintendent documentation — superintendent 0.5.3 documentation. Superintendent. https://superintendent.readthedocs.io/en/latest/index.html
- Heidi, E. (2021, August 16). How To Install and Use Docker Compose on Ubuntu 20.04. DigitalOcean. https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-compose-on-ubuntu-20-04
- Hogan, B. (2021, August 13). How To Install and Use Docker on Ubuntu 18.04. DigitalOcean. https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-18-04
- Jan Freyberg: Active learning in the interactive python environment | PyData London 2019. (2019, July 18). [Video]. YouTube. https://www.youtube.com/watch?v=W2bJH0iXTKc
- Liubimov, N. (2020, January 28). Learn faster with smarter data labeling – Towards Data Science. Medium. https://towardsdatascience.com/learn-faster-with-smarter-data-labeling-15d0272614c4
- Mishra, A. (2019). Machine learning in the Aws Cloud: Add intelligence to Applications with AMAZON SageMaker and Amazon Rekognition. Amazon. https://aws.amazon.com/sagemaker/groundtruth/what-is-data-labeling/.
- Overview of Docker Compose. (2021, August 27). Docker Documentation. https://docs.docker.com/compose/
- Pxhere.com. 2021. Free Images: rope, leather, string, metal, black, label, design, tag, fashion accessory 4928×3264 – – 902909 – Free stock photos – PxHere. [online] Available at: <https://pxhere.com/en/photo/902909?utm_content=shareClip&utm_medium=referral&utm_source=pxhere> [Accessed 29 August 2021].
- Solaguren-Beascoa, A., PhD. (2020, April 4). Active Learning in Machine Learning – Towards Data Science. Medium. https://towardsdatascience.com/active-learning-in-machine-learning-525e61be16e5
- Stud, P. (2021, May 17). How to use PuTTY to connect to your Amazon EC2 instance? Medium. https://positive-stud.medium.com/how-to-use-putty-to-connect-to-your-amazon-ec2-instance-96468ce592c1
Ready to test your skills?
If you’re interested in collaborating, apply to join an Omdena project at: https://www.omdena.com/projects