New scalable computing technique will make analyzing Big Data easier 

In This Story

People Mentioned in This Story
Body

With the advancement of data collection techniques, there has been an exponential increase in the availability and complexity of datasets, particularly spatiotemporal data; finding the computing power to analyze such Big Data, however, has remained a challenge for many researchers in various fields. Through a collaborative research project funded by the National Science Foundation, George Mason University statistics professor Lily Wang hopes to change that.  

Lily Wang, Professor, Statistics, College of Engineering and Computing. Photo by Creative Services
Professor Lily Wang, Department of Statistics, College of Engineering and Computing. Photo by Creative Services

Wang and the Chair of the Department of Statistics at The George Washington University, Huixia Judy Wang, are developing a form of scalable, distributed computing that could lessen the power demand on any single computer by distributing the analysis across a network of computers.  

“In the past, we knew there were insights hidden in the data, but due to computing limitations, we couldn’t access them,” said Lily Wang. “Now, with scalable quantile learning techniques, we can gain a deeper understanding of the entire data distribution and extract insights into variability, outliers, and tail behavior, which are critical for more informed decision-making.” 

Spatial and temporal data are increasingly being used in such research areas as climate study and health care, among others, noted Lily Wang. 

“This data richness presents a lot of opportunities for getting deep insights into dynamic patterns over time and space; but it also brings many, many challenges,” said Wang. Large datasets often exhibit heterogeneous and dynamic patterns, requiring new approaches to capture meaningful relationships. 

This project uses two large datasets: the National Environmental Public Health Tracking Network database from the Centers for Disease Control and Prevention and the outdoor air quality data repository from the Environmental Protection Agency. 

“Both datasets have been challenging to analyze in the past due to their size and complexity,” explained Wang. “But through scalable and distributed learning techniques, we’re now able to handle large-scale heterogeneous data across the entire United States.” 

One of the project's major innovations is the use of distributed computing to divide the data into smaller, manageable regions. Each region is analyzed separately, and the results are efficiently aggregated to form a comprehensive understanding of the entire dataset.  

“You can think of it like dividing the U.S. into small regions, analyzing each one separately, and then combining the results to create a comprehensive national analysis,” Wang said. “This method allows us to analyze millions of data points simultaneously without the need for supercomputers.” 

Beyond its goals for technical advancements, the project also emphasizes training the next generation of data scientists. Graduate students at George Mason and The George Washington will gain hands-on experience working with real-world data, helping to develop new computational methods.  

The project began on September 1, 2024, and is expected to last three years. It has already garnered attention, including recognition from the office of Congressman Gerry Connolly (D-VA). 

The potential applications of this research are far-reaching, from improving air quality predictions to understanding public health trends and beyond. Wang explained, "This work empowers researchers and policymakers to leverage vast amounts of data to address rising societal issues more effectively.”