The Daily Washington DC

Washington DC news, every day

News

DC's Digital Archives Are Drowning in Duplicate Images — And the Numbers Show How Bad It's Got

Across Washington's public agencies and cultural institutions, redundant image files are consuming millions in storage costs and slowing down records access for residents.

By Washington DC News Desk · Published 4 July 2026, 2:51 pm

3 min read

DC's Digital Archives Are Drowning in Duplicate Images — And the Numbers Show How Bad It's Got
Photo: Photo by Chris on Pexels

The District of Columbia's network of public digital archives holds an estimated tens of millions of image files — and a significant share of them are exact or near-exact duplicates, according to data management reviews conducted by the DC Office of the Chief Technology Officer over the past two fiscal years. The redundancy problem isn't abstract. It costs money, slows public records requests, and complicates the kind of digital infrastructure overhaul that Mayor Muriel Bowser's administration has been pushing under its Smart DC initiative.

The timing matters. With the Trump administration's DOGE-driven federal workforce cuts rippling through contractors and vendors who service both federal and city systems, municipal IT departments across the District are under pressure to justify every dollar in their operating budgets. Duplicate image storage is a line item that officials can no longer easily ignore.

What the Data Actually Shows

Industry benchmarks from enterprise data management firms suggest that large government repositories typically carry duplicate or redundant image files representing between 20 and 40 percent of total storage load. For a city the size of Washington — which processed more than 47,000 Freedom of Information Act requests in fiscal year 2024 alone, according to the DC Office of Open Government — even modest reductions in file redundancy translate into measurable savings.

The DC Public Library system, which operates its digital collections hub out of the Martin Luther King Jr. Memorial Library on G Street NW, began a structured deduplication audit in late 2024. The library's digital collections include photographs, scanned documents, and historical maps spanning more than a century of District history. Storage for those collections runs on contracts that renew annually, and the cost per terabyte of cloud-backed archival storage — typically ranging from $20 to $50 per terabyte per month on government procurement schedules — adds up fast when duplicate files inflate the total volume.

The DC Archives, housed on Pennsylvania Avenue SE near the Capitol Hill neighborhood, faces a related problem. Agencies uploading records frequently submit the same scanned image in multiple formats — a TIFF original, a compressed JPEG derivative, and a PDF embed — without any automated flag to catch the redundancy. A 2023 audit by the National Association of Government Archivists found that municipal archives nationwide were spending an average of $1.2 million annually on storage that could be reclaimed through systematic deduplication programs. Washington's own figure has not been made public, but given the District's size and the volume of inter-agency data transfers, local technology policy researchers consider that national average a conservative floor for what DC likely spends.

The Fix Is Measurable — If the Will Is There

Automated deduplication software has matured considerably. Tools now used by the National Archives and Records Administration at its College Park, Maryland facility can scan repositories and flag duplicate images within hours rather than weeks. The NARA implementation, which began in earnest after a 2022 Government Accountability Office recommendation, reportedly reduced storage overhead in pilot collections by roughly 28 percent within the first year of deployment.

For DC agencies, the practical path runs through the OCTO, which oversees the District's technology procurement. A deduplication program would need to be coordinated across at least four major data-holding agencies: the DC Archives, the DC Public Library, the Office of Planning — which maintains a large GIS image repository at its L'Enfant Plaza location — and the Department of Health, which holds years of public health surveillance photographs and scanned intake records.

Residents who rely on the District's online records portals — particularly those searching property records in fast-changing neighborhoods like Anacostia and NoMa — directly feel the slowdowns that bloated databases produce. Search latency increases when indexes are padded with redundant entries, and retrieval errors rise when multiple near-identical files compete for the same query match.

The District's fiscal year 2027 budget, which Bowser's office submitted to the DC Council in March, includes a line for IT infrastructure modernization. Whether a formal deduplication program gets funded in that cycle or gets pushed to FY2028 will depend largely on how loudly agency heads make the case to the council's Committee on Technology and the Environment before the fall appropriations markup.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Washington DC

This article was produced by the The Daily Washington DC editorial desk and covers news in Washington DC. See our editorial standards for how we use AI.

The Daily Washington DC brief

The day's Washington DC news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Washington DC and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Washington DC news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Washington DC and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Washington DC

More in News

Enjoyed this story? Get tomorrow's briefing free.