Washington's public institutions are sitting on millions of duplicate digital images, a sprawling redundancy problem that is quietly draining storage budgets and slowing down archivists at agencies from the DC Public Library system on G Street NW to the Smithsonian's digitization offices on the National Mall. The scale of the problem has become measurable enough that technology managers inside several District institutions are now pushing for systematic deduplication programs — the process of identifying and replacing redundant image files with single authoritative copies.
The timing matters. The Trump administration's ongoing federal workforce restructuring, driven in part by DOGE-related efficiency reviews, has forced shared service agreements between federal and DC municipal technology offices into limbo. That uncertainty has delayed at least two planned data-consolidation projects that city officials had expected to launch before the end of fiscal year 2026, which closes September 30.
What the Numbers Actually Look Like
Across large municipal archives, industry benchmarks suggest that duplicate or near-duplicate image files can account for between 20 and 40 percent of total digital storage consumption — a range cited in data management literature from organizations including the Storage Networking Industry Association. For an institution holding tens of millions of scanned records, that translates to real costs. Commercial cloud storage pricing in 2026 runs roughly $0.023 per gigabyte per month on standard tiers from major providers, meaning a repository sitting on even 50 terabytes of redundant image data could be spending more than $13,000 a year on storage it functionally does not need.
The DC Public Library's People's Archive, which holds photographic and documentary collections covering the District's history from the Civil War forward, completed an internal file audit in early 2025. The archive, housed at the Martin Luther King Jr. Memorial Library on G Place NW, found duplicate entries across multiple acquisition batches — a common byproduct of phased digitization drives where the same physical photograph gets scanned more than once by different contractors or volunteers. The library has not published the specific duplicate count from that audit, and a spokesperson's office did not respond to a request for comment by press time.
The Anacostia Community Museum, the Smithsonian's neighborhood-rooted institution on Fort Place SE, has been expanding its digital holdings as part of a broader push to document the rapid gentrification reshaping that corner of Ward 8. Archivists there have flagged that community-submitted photo donations frequently arrive with duplicates already embedded — multiple family members submitting the same scanned image independently. The museum's collection management system requires manual review to catch those overlaps, a labor-intensive step that consumes staff hours the institution can ill afford after federal hiring freezes reduced Smithsonian support staff across several facilities.
The Deduplication Push — and What's Slowing It Down
Automated deduplication tools have existed for years, but adoption inside government-adjacent archives has lagged. Procurement rules, interoperability requirements with legacy catalog systems, and the need to maintain chain-of-custody documentation for historical records all complicate what would be a straightforward IT fix in a private company. The DC Office of the Chief Technology Officer, headquartered at One Judiciary Square on D Street NW, has been developing updated data governance guidelines that would cover image file management across District agencies, but a published timeline for that rollout has not been confirmed publicly.
Federal funding uncertainty compounds the delay. The National Endowment for the Humanities, which has historically provided grant support for exactly this kind of archival infrastructure work, has seen its programming disrupted by budget negotiations on Capitol Hill through the spring of 2026. Several DC-based institutions that had pending digitization grant applications have been waiting longer than the standard 90-day review period for responses.
For institutions that cannot wait for a top-down solution, technology managers recommend conducting a hash-based audit — comparing file fingerprints rather than file names — as a low-cost first step. Open-source tools can run those comparisons against existing collections without requiring new procurement approvals. The harder question is governance: who has authority to delete a file from a public archive, even a redundant one? In the District, that answer is not yet settled, and until it is, the duplicate images — and the bills for storing them — will keep accumulating.