Unemployment is an important factor that impacts individuals and the nation’s economy, regions, and communities. During the pandemic year, the unemployment rate in Nigeria increased to 33.30 percent in the fourth quarter of 2020 from 27.10 percent in the second quarter of 2020, while the underemployed rate is 28.7% giving a total percentage rate 61.9% of Nigerians that are unemployed or underemployed.
Through the project, Omdena Nigeria (Lagos Chapter) aims to create a machine learning model that will merge National Youth Service Corps (NYSC) graduates with possible job opportunities based on their career path and provide employable job skills recommendations.
Figure 1 describes the pipeline process flow for Omdena, Nigeria – Solving Unemployment in Nigeria using Artificial Intelligence project. The steps contain data gathering, data pre-processing, data visualization and machine learning are briefly explained in the following sections.
The first step in the project was to gather relevant data to create a machine learning model and recommend appropriate job skills for the unemployed. Below are the steps and process of data gathering, also known as data collection.
The uniqueness of the type of recommender system we aimed at building required us to gather and extract datasets that are different from the conventional datasets used to build recommender systems. The conventional recommender system would use a “users ratings dataset” to build a content-based filtering recommender system and a “user profile dataset” to build a collaborative-based filtering recommender system. However, our task makes no use of users rating but the profile of whom the job is to be recommended to and the job to be recommended. This requirement led us to gathering data from several platforms as highlighted below.
To collect the data, we first validate what type of data we need for the problem statement. Based on finding employment information, we need to look at employment data on job listing pages and graduate student lists to get such data. Table 1.0 shows pages that list the data.
Table 1.0. Source of data collection
We selected LinkedIn as the job-finding website to get the list of available jobs in Nigeria to create the database. We scraped the job title information using BeautifulSoup and requests which are libraries in Python.
The technique of extraction from Linkedin was to:
1. Filter for jobs for Nigerians on LinkedIn.
2. Collect the unique urls of the jobs.
3. Loops through these unique jobs URL and collect the entire content on the jobs page.
4. Filter the content for important information.
5. Convert dates in words to actual dates: “3 days ago” and “x weeks ago” to DD/MM/YYYY format.
6. Collate these data to a Dataframe
7. Output the collated data to csv file.
NB: these extraction processes have been automated into a python script with logging for future use. In addition the graduate profile dataset was developed by converting the members’ CV to text file.
We scraped the data based on the following categories, as shown in Figure1.0.
- Beautiful soup
The data collected is unstructured data. So, the next stage we will be proceed to data processing.
Data preprocessing plays an important role in your workflow. You need to transform the data in a way that the computer would understand and be able to work with it.
Data preprocessing involves manipulating or dropping data before it is used to ensure or enhance performance. It is unarguably one of the most vital tasks to carry out while building a machine learning model as the dataset should be cleaned and wrangled enough to release it of irrelevancies and redundancies, thereby reducing the noise, which will increase the model’s output after all necessary machine learning techniques is undergone.
The derived output from the previous data gathering process is unstructured and needs a lot of cleaning and wrangling, performed in a step-by-step process. The raw data collected was filtered, sorted, processed, analyzed, stored, and then presented in a readable format by checking for errors, duplication, miscalculations, or missing data.
The first raw dataset gathered was a list of jobs available in Nigeria, scraped from LinkedIn here and a preview in Fig 2.0 below.
Scanning through this dataset, you can notice many irrelevancies and noise that need deletion. We started by dropping unnecessary columns (‘Date posted’ and ‘LinkedIn link’), then inspected the rest of the datasets by; checking for null values, renaming, and replacing misspelled words, sentences, and characters.
The ‘Location’ column needed a little bit of work, and it was cleaned and left with specific states in which the available jobs are. A new column was required to be named ‘Degree Type,’ which consisted of the particular types of degree an applicant can hold for a specific job. To achieve this, we had to list possible degree types in Nigeria to extract them from the ‘Job Description’ column, which had all the necessary information regarding each job. The ‘findall’ method came in handy and had the job done, displayed below in Fig1.1.
The next process involved adding a new column, ‘Skills’, which is specific to a job type because the recommender system matches available jobs and required skills an applicant has to possess to get qualified for the job. We manually set up a new dataset after team members did various research to find skills required to prepare for each career with a preview below in Fig 4.0.
The manually developed dataset had to be merged with the existing dataset by calling on the merge method, which automatically fixed it in a skill column, and our project was halfway complete. The preview of the merged dataset is found below in Fig 5.0.
Having completed the cleaning and wrangling process, we had to convert the dataset to values the computer understands and can work with. This was a Natural Language Processing problem. This process was accomplished with the Term Frequency – Inverse Document Frequency vectorizer (TFIDF vectorizer), which is based on a bag of words and contains insights about the less relevant and more relevant words in our dataset.
Term Frequency (TF) is the measure of the frequency of a word(w) in document(d). It is the ratio of a word occurrence in a document to the total number of words in the document
Inverse Document Frequency measures the importance of a word and it provides more weightage to each word based on its frequency in the corpus(D). It is defined as:
TFIDF is the product of the above mentioned as it gives more weightage to the word that is rare in the corpus and provides more significance to the word that is more frequent in the document
The corpus, in this sense, was the whole dataset combined in a single row named combined_corpus’. Common words (such as ‘the’, ‘is’, ‘are’) had to be removed to increase our model’s performance using the inbuilt nltk stopwords library. A few irrelevant stopwords in the dataset were appended manually.
After initializing the fit-transform TFIDF, each word in the combined_corpus was represented with a number within 0-1. An array shown in figure 6.0 below was obtained, which gave us the output to proceed to build the Nearest Neighbour Algorithm and the Cosine Similarity Algorithm.
We created a word cloud visual on Location of job listings which gave us an intuition of what states in Nigeria had more job postings/openings as shown in figure . As expected, Lagos state has the highest count, signifying that the state had more job openings and offers in the country.
In Fig 8, We also discovered that in terms of skills (both hard and soft skills), Management skills and Communication skills were top priority for the hard skills. As for the soft skills, Microsoft Office and Analytics tools were of great importance.
In figure 9 We focused on the type of degree that was necessary for job qualification. We discovered that tertiary education was of great importance in terms of the level of education qualification for jobs.
The Machine Learning Model
The aim for this step is to:
- Build a recommender system equipped with the capacity to match graduates with job listings based on their tertiary education and skills.
- Suggest relevant skills that are in demand based on the word cloud visualization from the job listings skills dataset.
There are three major techniques used for building recommendation systems.
|Content based||makes recommendations based on similarities between the content fields, users to item. Example, recommending a movie to a user based on the similarities with the user’s description or preference.|
|Collaborative based filtering||make use of similarities between known users preferences to make recommendations for new users.|
|Hybrid model||makes use of both content and collaborative based filtering techniques.|
Table 2.0 Method building recommender system
Our approach makes use of the content-based filtering as we only had access to public job posting datasets on different sites. After feature encoding using TfidfVectorizer, we built two algorithms K-Nearest Neighbors and Cosine similarity Algorithm.
K-Nearest Neighbors Algorithm
Nearest neighbors is a simple clustering technique that creates clusters based on the closest distance between data points. As shown in fig 10.0 This technique finds the shortest distance between two data points and fuses them into one cluster. This step is repeated so that the next shortest distance between two data points is found, which either expands the size of the first cluster or forms a new cluster between two individual data points.
We are using the nearest neighbor to test the efficiency of our model using CV dataset (Curriculum Vitae) as our testing dataset. The nearest neighbor was programmed to loop through each vectorized job posting and return the Euclidean distance between each job vector and the vectorized test user CV alongside the index of the job postings to track which job returned the shortest distance.
The formula for the Euclidean distance:
It is important to note that the nearest neighbors’ k value was set to 5, signifying we only needed the top 5 most similar jobs to a user with respect to their skills and contents on their curriculum vitae.
Fig 11.0 shows screenshot of the function that uses nearest neighbors to return the shortest distance between a test user and the vectorized job postings in the job listings dataset.
Usually the values range from values greater than or equals zero(0). So typically, the lesser the magnitude of the distance between the two vectors, the more likely they are similar and therefore recommended by the algorithm. But the higher the magnitude the less likely those vectors are similar and would be recommended.
Cosine similarity measures the cosine angle between the two texts or documents. It compares two documents on a normalized scale. It can be among two objects measuring the point of cosine between the two objects. It compares among two objects measures the angle of cosine between the two objects. It compares two documents on a normalized scale. It can be done by finding the cosine of the angle between the two vectors. The theta angle between jobs will decide the closeness between the two jobs. The theta ranges from 0- 1. In the esteem of the theta is close 1 at that point it is most comparative and on the off chance that it’s close to at that point it is slightest comparable. The job will be suggested on the off chance that it is near to 1 something else there would be no similitude between them. It’ll prescribe the best jobs to the client based on the cosine similarity.
To test the efficiency of our model we got a CV dataset (Curriculum Vitae) as our testing dataset after feature encoding and transformed the testing dataset using Tfidfvectorizer. We used the cosine similarity model to predict the angular distance between the training data and the testing data as shown in fig 12.0.
The values returned by this algorithm range from 0 to 1. Values closer to 0 signifies a low similarity between both vectors while values closer to 1 signifies strong similarity between both vectors.
We were able to create a recommender system using Nearest Neighbors and cosine similarities but we had challenges collecting enough relevant data for the best optimal performance of our models. For the best performance of our model, we will suggest the use of more training data.