Satellites orbiting Earth and beyond are capturing unprecedented volumes of information – from climate measurements to planetary observations – but much of this data remains unexamined. These untapped troves, often referred to as “dark data,” represent raw potential that, if analyzed, could yield scientific breakthroughs, guide policy and even enhance national security.
Yet decades of satellite observations have left vast swaths of this data dormant due to sheer volume, complexity and funding limitations.
“The term dark data basically refers to all the information that we stored and just didn’t analyze or develop [scientific] return from,” said Dr. Chris Mattmann, chief data and artificial intelligence officer at the University of California, Los Angeles.
“In the civilian space program, say with NASA, we put up these missions, and in some cases 95% of the data isn’t fully processed,” said Mattmann, former division manager of the Artificial Intelligence, Analytics and Innovative Development Organization at NASA’s Jet Propulsion Laboratory.
Early satellite missions relied on relatively manageable data volumes and straightforward processing workflows built around specific instruments such as spectrometers, lidars and radars, Mattmann said. Over time, both the scale and complexity of spaceborne data expanded significantly, outpacing those traditional approaches to analysis, he said.
NASA’s Orbiting Carbon Observatory, for example, collected 150 terabytes in the first three months alone – about 1,500 DVDs’ worth, Mattmann said. “Most scientists were used to having the data on their desktops. Now, you need Spark, MapReduce and robust software pipelines to make sense of it. The skillset required has shifted dramatically,” he said.
Technical and Institutional Barriers
Dark data has emerged from a combination of technological growth and institutional constraints, said Mattmann. Satellite instruments now generate enormous volumes of heterogeneous data, and processing pipelines have grown increasingly complex, Mattmann said. “Scientists had to become hybrid software engineers just to process the modern data,” he said. “They moved from IDL or MATLAB locally to big data frameworks that require advanced software engineering.”
“Scientists had to become hybrid software engineers just to process the modern data.” -Dr. Chris Mattmann, UCLA
Funding structures also limit the analysis of existing datasets. Unlike DoD funding, which could award a company $30 million for a space traffic management system, NASA awards tend to be minimal, Mattmann said. “NASA defines its core mission around collecting and disseminating data, but analysis is often left to smaller, competitive awards,” he said.
In fact, just days after the historical Artemis II mission launch on April 1, the White House proposed a $5.6 billion cut to NASA’s 2027 budget.
As a result, dark data retrieval missions tend to be driven by individual investigators, Mattmann said.
“But that leaves a lot, so that’s why you have dark data,” he said.
Prioritizing which legacy datasets are worth salvaging is another challenge. NASA decisions are often derived from decadal surveys conducted by the National Academies of Science and Technology, which establish research priorities for astronomy, Earth science and other disciplines, said Mattmann.
Data Selection, Processing, and the Role of Software
As data volumes grow, the risk of valuable information slipping into “dark” territory only increases. That makes it essential for missions to prioritize the most meaningful data from the start, ensuring limited downlink and processing resources are focused where they provide the greatest value. That shift has led to a range of approaches for screening, ranking and scheduling data on-orbit.
Satellites that collect EO, SAR and RF data face the same fundamental limitation as other edge sensors: They can gather far more information than they are able to transmit back to Earth, said Chris Gregory, vice president of product management at Kratos. “Downlink bandwidth is one of the most precious resources for these companies, and this is only available during short time windows while the satellite is in range of a ground station,” Gregory said. Because of that, onboard triage has become essential. “Even exquisite satellites with access to high-speed relay networks in space need to be mindful, but for most smallsats access to a modest-sized RF pipe a few times per orbit is all they get,” he said.
One approach is to use onboard software to run basic quality checks so that bandwidth isn’t spent transmitting cloud-covered imagery or RF recordings without any detectable signals, said Gregory. Another option is to use more advanced scheduling tools that can update priority decisions right up until a satellite begins its downlink, relying on coordinated software both in orbit and on the ground, Gregory said.
Machine-learning-based quality control is still largely untapped but remains a promising area for improving data screening in space and reducing the amount of low-value data that needs to be processed on the ground.
Machine-learning-based quality control is still largely untapped but remains a promising area for improving data screening in space and reducing the amount of low-value data that needs to be processed on the ground, Gregory said. A flexible software environment also allows missions to integrate these capabilities in different ways, whether through built-in scheduling features or by uploading their own processing functions, he said.
Running quality control in space allows operators to optimize their downlink bandwidth by not wasting any of it on poor quality data, said Gregory. With ground-based quality control, some downlink would be wasted but operators could still reject bad data, ultimately preserving time and computing resources during processing, he said. This would in turn reduce the average latency between the time the data arrives on the ground and when it reaches end customers, he said.
“Legacy systems are purpose-built, hardware-based [systems] that have no way to easily change what data they produce or its format,” said Stuart Daughtridge, vice president of advanced technology at Kratos, echoing Gregory. “Often that can significantly limit what can be done with the data and how it can be used.”
AI, Academia and Open Source as Solutions
As for data that’s already gone dark, AI and machine learning could aid in its recovery and processing; however, domain expertise remains critical because of the scientific and historical nature of these datasets, according to Mattmann. “AI doesn’t understand science data,” he said. “Drop an HDF file on OpenAI ChatGPT and ask it to do something with it and then be surprised that it doesn’t know what to do.”
You need domain expertise to interpret the metadata, sensor artifacts and historical context, he said.
With this in mind, academia and national labs are underutilized, Mattmann said. Higher education trains our future AI workforce and provides the skills we need to analyze our own data rather than outsourcing everything to big tech, Mattmann said. National labs, both DOE and NASA, were created to solve hard problems. Yet today, these capabilities are often ignored in favor of large, global tech firms with big terrestrial data centers, he said.
Open-source software also plays a critical role in rapidly building the talent needed to compete with big tech by giving people hands-on access to real tools and data, instead of forcing them to spend years learning complex concepts before they can contribute, said Mattmann.
“Open source actually trains them because it allows and provides a framework for people to kick the tires on your technology and your data in a way that again upskills the workforce and helps us compete against our foreign adversaries,” Mattmann said.
Emerging business models are another pathway for unlocking the value of dark data. While launch gets a lot of attention in the space world, analysis and data are underrated sources of long-term value.
Emerging business models are another pathway for unlocking the value of dark data, Mattmann said. While launch gets a lot of attention in the space world, analysis and data are underrated sources of long-term value, he said.
Instead of spending $200 million to operate a new remote sensing instrument, they could spend $30 million to observe the archive of dark data, Mattmann suggested.
Mining historical datasets can support geospatial products, educational tools or insights for defense and intelligence, Mattmann said.
Looking Ahead: Preserving the Value of Orbital Data
The next generation of satellite data faces the risk of becoming dark if governance and standards lag behind commercial innovation, Mattmann warned. Companies like Starlink have world-class, commercial data management—especially around archiving and avoiding dark data—but those capabilities aren’t translating to broader civilian systems, said Mattmann.
“The problem is I see a huge skill gap of technical sophistication between those verticalized companies, and I see no effort to standardize the modern learnings,” he said.
Mitigating these risks necessitates renewed investment in national labs, academia and open-source initiatives, coupled with cross-sector collaboration, Mattmann said. “We need American entrepreneurialism and innovation, but we kind of need to leverage our national capacity at the universities too,” he said.
Nations that take deliberate steps to align AI innovation with robust standards are better positioned to strengthen their geopolitical standing, Mattmann said.
“Because the current climate is what it is, the West needs to really understand how to take advantage and exercise its capabilities,” he said.
Explore More:
Beyond Connectivity: The Promise of Orbital Data Centers
AI for EO, Neural Network Supervisors and Overcoming the Clouds
Earth Observation Market Trends to More Growth and Value From Data