The effects of a changing climate are visible: droughts, reduced harvests, destruction of critical infrastructure and displacement of communities. Situations like these are forcing different countries to rethink the sustainability of their economic models (Paris Agreement) and put a value on nature that goes far beyond money.
Sustainability is providing for the present needs without compromising future generations’ essentials. Therefore, becoming aware of the importance of moving towards a circular economy (reuse and recycle) adds brand value and competitive advantages to corporations: new business opportunities, attract talent, get tax incentives and subsidies, and more. Therefore, a do-nothing approach can mean a big loss in the future for any type of commerce.
Progress in machine learning (ML) has been driven by trying to improve models in standard benchmark data sets, even though it is estimated that data preparation accounts for 80% of data scientists’ time. The conventional knowledge has been to aggregate all possible data and develop a model good enough to deal with noise. But utilising lots of low-quality data means more infrastructure resources and energy consumption which are not always feasible for small companies and also have an environmental impact.
Recently, more emphasis has been put on employing a data-centric AI approach or “Good data”. In fact, Andrew Ng, a leading ML technologist, has launched a campaign in which he states that improving existing data quality is equally effective as doubling the training set with new samples.
Brain Pool Tech, the project organisers, is a technology integration company developing expert and machine learning solutions in the areas of resilience and sustainability. They joined Omdena’s incubator for impact startups.
The delivered project consisted on the prediction of turf health for a golf course, using drone-derived multispectral and thermal data (from April to October) and cross-referencing results with sensor data (which are already part of the land infrastructure). Therefore, the purpose of this article is:
- To explain the importance of assessing golf courses’ turf health and its impact on water preservation.
- To elaborate on why a data-centric AI approach or “Good data” on a basic model is better than chasing the state of the art models with low quality data.
- To present some Exploratory Data Analysis (EDA) methods and describe how they were used to tackle this challenge.
What is golf course maintenance and why is it important?
Golf course maintenance are the activities required to keep course resources in good condition. This includes: mowing and chemically treating the turf, inspecting irrigation systems, checking the water quality and quantity, and more.
Keeping the facility running smoothly while prioritising sustainability is not an easy task, as resources like water are costly. For example, freshwater accounts for only 3% of our surface water, as most of it is locked in various forms which include ice, glaciers, and groundwater. Meanwhile, usable water is already a scarce resource in many parts of the world. Furthermore, the amount required of this precious liquid for the irrigation of a course depends on the turf type, climate, rainfall, and soil.
Golf course water irrigation systems transport water, in a controlled fashion, from source to a desired area whether that is a green, tee or fairway. It is mainly composed of pumps, pipes, valves, control cables, sprinklers and computers.
Pumps pull the water from the source and push along a pipe network. The source can be groundwater, potable, surface (rivers, lakes and streams) and effluents. Meanwhile, the control cables switch taps and valves on and off to disperse water via sprinklers. A golf course can have between 500 and 5000 sprinklers installed throughout an 18-hole facility. Each one is individually controlled so it can be accurately timed.
What is a data-centric AI approach or “Good data”?
A data-centric AI approach or “Good data”, is improving datasets to get high quality data in all phases of the ML project life cycle in order to increase performance.
“Good data” is:
• Defined consistently
A significant number of badly labelled data leads to lower accuracy than when fewer but accurate samples are used. To keep consistency in labelling, clear instructions have to be provided but it is also important to include domain knowledge experts to spot subtle discrepancies.
• Sized appropriately
A high number of samples isn’t that essential as good performance can be attained with a small high quality dataset.
• Cover of important cases
The data has to clearly illustrate the concepts that the ML model needs to learn. Here subject matter experts can contribute with their knowledge. It is also important to check if the data is compatible with the structure that the algorithms assume from it.
EDA is a methodology for analysing datasets to summarise their main characteristics (ex. cluster analysis, box plots, etc). Therefore, it provides a foundation for further pertinent data collection and for suggesting hypotheses to test rather than to confirm.
Some techniques for collecting and tracking changes in data:
- Data augmentation: to increase the amount of relevant data points either by interpolation, extrapolation, etc or to make synthetic data.
- Feature engineering: to add features, by altering input data, that may not exist in its raw form.
- Data versioning: to track changes (additions and deletions) on the dataset. This ensures the reproducibility and reliability of models.
• Has timely feedback from production data
During the production stage of the ML project life cycle, the model will encounter data that differs from the training set. Therefore, an iterative process to evaluate models’ quality will help to react on time to distribution data drift and concept drift.
How was EDA used in the project?
A. What is a Quartile? How was it used to spot unhealthy zones?
In descriptive statistics, a quartile is a type of quantile (a cut point) which divides the observations within a dataset into four parts of nearly equal size. The main quartiles are:
- First quartile (Q1): It is the middle value between the minimum number and the median (Q2) of the dataset. 25% of the data is down from this point.
- Second quartile (Q2): It is the median of the dataset. 50% of the data lies below it.
- Third quartile (Q3): It is the middle value between the median (Q2) and the maximum number of the dataset. 75% of the data is positioned below.
These quartiles, along with the minimum and maximum numbers of the dataset, can be clearly visualised using a Box plot (graphical technique used in EDA). All this information is useful to describe the centre, the dispersion, and the presence of outliers within the dataset.
Finally, quartiles can be used to estimate the Interquartile Range (IQR), which is a measure of statistical dispersion often used to find outliers.
The identification of Regions of Interest (ROIs) was obtained by applying this method on the pixel level of the drone images. Quartiles were applied on NDVI (Normalised Difference Vegetation Index), DEM (Digital Elevation Model), and Thermal images. All results were appraised with the information provided by the golf course superintendent.
• Unhealthy: all values, from NDVI images, below Q3 were identified as being in poor health.
• Water-stressed: all values, from thermal images, above Q3 were recognized as regions lacking water.
In Figure 6, it can be seen the identified water-stressed areas which correspond to the dark red regions on the thermal image.
• Waterlogged: to find flooded regions, the three types of images were used.
- NDVI: all values below Q3.
- Themal: all values below or equal to the mean.
- DEM: manual work to identify topographically susceptible regions (ex. Areas at the bottom of an inclined slope are prone to waterlogging).
Then by combining all values (to get a higher probability), common affected regions were identified as waterlogged. Figure 7 shows the identified waterlogged areas on Fairway 2 that correspond to the noticeable brown regions displayed on the RGB image.
B. What is cluster analysis? How was it used to spot unhealthy zones?
Cluster analysis is a main task in EDA. It is used in many fields including machine learning to investigate the structure of data and discover its natural groupings. In other words, it finds the points that are similar to each other which is very useful for feature engineering. There are various algorithms with a different notion of what a cluster is (see Figure 8). Typical cluster models include:
- Centroid-based: the k-means algorithm, in which each observation belongs to the cluster with the nearest centroid (mean).
- Density-based: the Density-based spatial clustering of applications with noise (DBSCAN) algorithm, which defines clusters by density.
Both algorithms, k-means and DBSCAN were tested on NDVI, Thermal as well as Slope (calculated from DEM). But the latter yielded unsatisfactory results. Slope was used to account for the different natures of flat regions as opposed to steeper regions. Ex. proneness to waterlogging.
All results were inspected against the information provided by the golf course superintendent.
Figure 9, showcases the balanced clustering sizes and spatial distribution from the 17th of june imagery. These results become functional tools to quickly highlight areas that need closer inspection.
Finally, the spatial distribution of the class also provides insights into emerging patterns.
Climate change poses serious threats to the availability of essential-to-life resources, like water. Therefore, sustainability is an asset for any type of corporation as it provides fiscal advantages and new business opportunities.
Good data or data-centric approach, seeks to focus on the importance of high quality data rather than quantity. For this purpose, EDA is useful to better understand datasets and determine what data to aggregate and/or delete.
In this project quartiles and cluster analysis (frequently used in EDA) were applied to assess turf health of a golf course, specifically to locate waterlogged and water-stressed areas. Both proved to be effective to prepare a baseline which later can be used to benchmark more complex algorithms, and to specify what type of data needs to be collected.