Our Motivation
Lead has been known to be harmful to humans even in small doses for over 50 years. Exposure to even low levels of lead can result in damage to the central and peripheral nervous system, learning disabilities, impaired function of blood cells, stunted growth, cardiovascular effects, and many other problems. Children are at an acute risk of lead poisoning. Because lead accumulates in the blood stream, consistent exposure over time is especially harmful. Steps were taken to limit the use of lead in construction materials, particularly water pipes and paint, in the 1970s before its use was significantly limited by the EPA in 1984. However, children are still experiencing lead poisoning at alarming rates. While the Flint water crisis has brought renewed attention to this issue, investigations have shown many communities across the country with rates of lead poisoning in children exceeding that of Flint (5%). The persistence of this major public health problem and the creation of new relevant datasets create an opportunity to apply new thinking and techniques to solve it.
Water utilities typically perform quality testing of water supplies at the treatment plant, or the upstream end of a system. By the time water has been delivered to a residence, the distribution system may have added contamination to the water, as is typical with lead contamination. Utilities, especially those in older communities, often don’t know the exact details of their distribution systems including pipe material, pipe size, or even the location of pipe. Early systems were built without meticulous records, and construction in the field often deviates from plan. Excavation to find out what is in the field is prohibitively expensive and intrusive. Prioritizing where to spend municipal dollars is a common approach, and new datasets give us a chance to redefine the prioritization method using machine learning to predict areas where the community is at highest risk.
Our Mission
Lead Alert provides data-driven guidance to identify areas that are at high risk of lead contamination in water supply systems throughout the state of California.
Background
While it is possible for lead to be in the water supply source, lead typically enters the water system from pipes and soldered joints either in service lines or within the home. This is the result of a chemical corrosion process where chemicals in the water react with the pipe materials to leach lead into the water.
Orthophosphate is commonly added to the water supply at treatment plants and distribution locations to build a layer on pipe walls to protect against corrosion, thereby sealing the lead into the pipe. Where the orthophosphate layer breaks down, dissolved oxygen attaches to the pipe wall. The oxygen then reacts with the lead in the pipe wall to oxidize the lead. Oxygen then bonds with hydrogen, leaving the lead in its oxidized state. The oxidized lead will now dissolve into the water stream, contaminating the water on its path of delivery. Other chemicals like chloride will accelerate corrosion.
From this chemistry, we were able to identify several risk factors - both the chemicals in the water and the properties of the pipes. Statewide California water supply testing data gave us levels of these chemicals (including lead, orthophosphates, and chloride) in the water at various distribution sampling locations. This was tracked per water district.
We can also identify locations that are likely to have pipes with more lead. Lead pipes are more common in communities that were built a long time ago and have not been renovated since the EPA’s ban of lead materials in construction. We are using census data from the American Community Survey (ACS) to represent many of these factors, including age of construction, property value, and demographics, all of which have been shown in previous studies to be predictive of lead risk.
Water | Pipes | |
---|---|---|
Chemical Risk Factors |
Lead Corrosive Chemicals Anti-Corrosion Additives Low pH (high acidity) |
Lead Pipes Lead Solder on Pipe Joints |
Indicators |
Presence of Lead Presence of Chlorides Absence of Orthophosphates |
Older Construction (pre-1984) Low Likelihood of Improvements |
Data Layers |
Division of Drinking Water
|
American Community Survey
|
Water Quality Data
Within the California State Water Resource Control Board, the Division of Drinking Water is responsible for maintaining the data related to Assembly Bill 746 requiring testing at all schools. The map at left shows the testing results as of February, 2018. More information can be found on the state’s website.
The severe consequences of lead poisoning are what makes it a public health hazard, not the frequency of poisoning. Overall, only a small percentage of test results show high levels of lead. Of the 8,688 readings in the California database, only 415 (4.7%) had levels above 5 micrograms/liter. 72 records (<1%) were above the EPA’s action level of 15 micrograms/liter. For machine learning prediction, this presents an imbalanced dataset. Additionally, over 8,000 of the 8,688 readings were listed as “< 5 micrograms/liter”, an inexact value that required stratified classification.
The Division of Drinking Water also maintains the water quality testing database used for chemical indicators. Water supply sampling locations were linked to water districts and ultimately census tracts served. Indicators explored included:
Housing and Demographic Data
Lead Alert utilizes several common tables from the American Community Survey. All data was from the 2016 5-year estimates, the most recent ACS data available. Tables used include the following:
The map allows you to see some of this data that went into the Lead Alert model.
Aggregation and Transformation
The data came in many formats, and the primary challenge was to aggregate information across a common geographical unit. The census tract was the predominant unit across the sets and represented the underlying infrastructure and construction information. Using a series of geographical comparisons, each source was transformed to census tract units.
The lead sampling data required association with a geographical location and a water source in order to be combined with the chemical and ACS sources. The school name and water source name were the only identifiers filled in. A series of fuzzy comparisons were used to align schools with publicly available water source shapefile and schools list complete with location and water source identity number. Unmatched schools were manually searched or geolocated using the Google Geolocation Service.
The combined lead sampling, water supply testing, and ACS data required a series of joins to capture max, mean, min, percentage, etc. for each census tract. The final transformation resulted in the training data set that contained water supply testing and ACS features with a lead contamination result for each census tract in California.
Domain background and machine learning techniques used to build the model
In 2017, the state of California passed Assembly Bill 746 requiring all public schools in the state to test their water fixtures for lead by July 2019. A previous law had offered financial assistance for testing, but not all schools took advantage of it. The test results from the previous program and early returns on the mandatory program are available in a public database providing some coverage of the state. This newly available dataset offers up recorded data of lead levels at the downstream end of water supply systems all throughout the state. Lead Alert uses these test results as labels for a machine learning model, allowing us to predict where contamination may be entering the system from the conveyance system itself.
Previous studies have shown that characteristics related to housing and construction, demographics, and property values are indicators of lead contamination for individual buildings or properties. These studies were mostly focused on a single city - Flint, Michigan. Lead Alert tries to increase the footprint of prediction to a much larger area, starting with the state of California and potentially proceeding to the United States. To do this, Lead Alert uses data from the American Community Survey as feature data to predict risk of lead contamination at the census tract level, a standard geometry defined by the United States Census Bureau to roughly represent a neighborhood. Census tracts have large amounts of aggregated data readily available.
Lead Alert also attempts to use water quality testing data as features in prediction. This data includes upstream water quality tests for chemicals that are key elements in the lead contamination process. Key elements include lead, orthophosphates, and chlorides.
To work with the data, Lead Alert employed data leveling and ensemble learning techniques. Data leveling techniques like synthetic minority over-sampling technique (SMOTE) create synthetic instances of the minority class (in this case high lead test results) based on the features of actual instances of the minority class. This avoids overfitting. Boosting-based ensemble modeling methods tend to perform well on imbalanced datasets as they create many weak learners that in aggregate create a strong learner capable of making accurate predictions. XGBoost (Extreme Gradient Boosting) is an efficient implementation of a boosting method whereby the loss function of previous learners is used when building additional learners, increasing both speed and accuracy.
Refer to the Github repository for the complete collection of modeling code and artifacts. The technical report captures further detail on the effort.
Feature Selection
Two sets of features were used for the modeling: the full feature set and a feature subset. The full feature set contained 757 features, which included the chemical threshold percentages but did not include the lead result field. The feature subset contained a total of 44 features, also including the chemical threshold percentages but no lead result data. The selected fields were observed as having higher performance when comparing ensemble models. These fields include data related to age of construction, population age, and the year residents moved into their current residence.
Preprocessing
The resulting data set was randomly split into training and test data with the test set being a third of the total size. The training data was then split again into training and validation with validation being a third of the total training size. The validation split further reduces the training size, although the intention is to compare validation and test results for model stability.
To work with the data, Lead Alert employed both data leveling and ensemble learning techniques. Given the imbalanced nature of the data, preprocessing options were considered to help highlight the characteristics of the minority class that would otherwise be treated as noise. Data leveling techniques like synthetic minority over-sampling technique (SMOTE) create synthetic instances of the minority class (in this case high lead test results) based on the features of actual instances of the minority class. This avoids overfitting that would result from strictly oversampling. The relatively small size of the aggregated lead contamination data lent itself to oversampling of the minority class. Random oversampling and SMOTE were employed on the training data using the Imbalanced-Learn library, along with no preprocessing for comparison.
XGBoost
Boosting based ensemble modeling methods tend to perform well on imbalanced datasets as they create many weak learners that in aggregate create a strong learner capable of making accurate predictions. XGBoost (Extreme Gradient Boosting) is an efficient implementation of a boosting method whereby the loss function of previous learners is used when building additional learners, increasing both speed and accuracy.
The XGBoostClassifier was trained using a 10-fold cross validation scored by accuracy. Once the best parameters were calculated, a XGBoost classifier was trained using the training data on the best parameters. Ten-fold cross validation and early stopping using the error metric were employed to help prevent overfitting. The boosted round number from the optimized XGBoost classifier as well as the best parameters were captured for defining final model to be used in evaluation.
Evaluation
With the final model ready for evaluation, we predicted using the validation and test sets. The prediction outcome was a set of probabilities. We evaluated several probability thresholds but have chosen a 0.75 prediction probability cutoff for indicating lead contamination. A 0.50 cutoff is more common, but we wanted to remain conservative given the nature of the problem space. We wanted locations that we predicted as potentially at risk to be very likely at risk, not just more likely than not. The true positive and false positive rates may be used to calculate the optimal threshold for the training data, although this did not prove beneficial in testing.
Ultimately, the chosen feature data was not predictive of the selected label data. The factors that show some predictive power at the building level did not have the same relationship to lead contamination once aggregated to the census tract level. The two figures below show the distributions of percent of homes built in each tract in different time periods, first for all tracts and then for only those with lead readings above 5 µg/L. While the tracts with readings above 5 µg/L show some variation, the distributions are largely similar. This could one reason be why the model had difficulty making predictions.
Also, the upstream water supply testing data did not prove to be predictive of lead contamination either. Lead contamination is most heavily influenced by the materials within the building itself.
To pursue similar predictive modeling in the future, it will be critical to use building property data as feature data. This is typically contained within county assessor data. We did not pursue this data because our project timeline did not allow for the time necessary to acquire data from every county in California and merge them into the same schema. Additionally, not all county assessor offices in California have their data online, and some counties charge fees for the data. If the state were able to formalize a structure and portal for assessor data throughout the state, it would allow more analysis to occur.
If a future study were to occur using building level feature data and the school testing database as label data, we would recommend following many of our steps around data leveling and modeling. The data will be imbalanced, requiring data leveling. Also, if feature data is at the building level, spatial clustering methods should be evaluated. Not every building is a school, and schools will have different properties than other building types, but this is still the most complete data that can be used to train a model.
We utilize a series of applications and Amazon Web Services (AWS) throughout the iterative phases of this effort: data acquisition, exploratory data analysis (EDA), conditioning and aggregation, modeling and testing, and information serving. As the creators, we control the data pipeline from the raw source to the interface for the users - water system managers.
Data is acquired manually from the providers and explored using a combination of Python, Jupyter, and Postgres applications connected to an Amazon RDS Postgres database instance. The Postgres database provides a way to manage and track the individual sources throughout conditioning and aggregation using table versions. Modeling is conducted using Python libraries within Jupyter. The modeling and testing results are stored in Postgres in prep for information serving.
CARTO, a service for cloud-based geospatial tools, provides the means to create geospatial views of the data versions and modeling results for the creators. Final geospatial views are created using the desired set of modeling and testing results stored in the Postgres database. The views are then linked from the Lead Alert website for the water system managers to consume.
The Lead Alert website is served up to the water system managers using AWS S3 and Route 53. The html and Javscript code, stored in S3, utilizes the w3.css framework. Route 53 manages the domain name service for the registered leadalert.io domain. As more data becomes available, updates will be distributed through the Postgres database, CARTO, and the web interface.
Resources to learn more about preventing lead contamination
Home Water Testing
If you have reason to believe your water may be affected by lead contamination, you should have your water tested. The EPA provides information about home water testing, including links to local labs where water testing is performed. Please see the EPA’s Home Water Testing Fact Sheet to learn more.
In Home Water Filters
If you do identify lead in your home’s water supply, there are steps you can take to protect yourself. While you should definitely try to identify the cause of contamination and address it through permanent solutions, in home filters can remove or reduce lead from water. The Environmental Working Group has a resource to help you select water filter based on contaminants to be filtered out and the type of installation. It provides links to purchase from Amazon, with proceeds from filters purchased through this link going to the Environmental Working Group as an Amazon affiliate.
LEAD CONTAMINATION REFERENCES
PREVIOUS STUDIES
Thank you to all who support this effort