Washington DC's government agencies and cultural institutions moved this week to address a growing crisis in their digital collections: tens of thousands of duplicate images clogging storage systems, slowing public access portals, and driving up data costs at a time when federal funding is already under pressure from ongoing budget restructuring.
The push comes as the DC Office of the Chief Technology Officer, based at 200 I Street SE, flagged the issue in an internal review circulated to city departments in late June. The review identified redundant image files as a significant contributor to inefficiencies across the District's shared cloud infrastructure — a problem that has quietly compounded for years but landed with new urgency as DOGE-linked federal efficiency mandates trickle down to local government contracts.
What Happened This Week
The Anacostia Community Museum, part of the Smithsonian Institution network on Fort Place SE, confirmed it is mid-way through a deduplication audit of its digitized photograph collections, a project that began in earnest on June 30. Staff there are working through an estimated 40,000 scanned images, a portion of which were duplicated during a 2022 migration to a new content management system. The museum did not provide a completion date but said the audit is expected to reduce active storage load by roughly 30 percent once finished.
Across town, the DC Public Library system — headquartered at the Martin Luther King Jr. Memorial Library on G Street NW — began deploying automated deduplication software across its Digital Collections branch on July 2. Library administrators confirmed the rollout in a brief public notice posted to their website. The tool, licensed through a procurement finalized in May, scans for pixel-level matches and near-duplicates introduced during mass scanning drives conducted between 2018 and 2024.
For organizations managing large image archives, duplicate files are rarely trivial. Storage costs on commercial cloud platforms typically run between $20 and $25 per terabyte per month, and a single unchecked digitization campaign can generate duplicate rates exceeding 15 percent of total file volume, according to published benchmarks from the Digital Preservation Coalition. For a mid-sized municipal archive holding several hundred terabytes, that translates into thousands of dollars in unnecessary monthly overhead.
Why the Timing Matters
The urgency is partly financial and partly bureaucratic. The federal General Services Administration has set an October 1 deadline for certain federally affiliated DC institutions to demonstrate compliance with updated data governance standards tied to cloud storage contracts. Institutions that cannot certify clean, non-redundant digital inventories by that date risk renegotiated contract terms or reduced storage allocations.
Mayor Muriel Bowser's office has not issued a formal citywide directive on the matter, but technology staff at the John A. Wilson Building on Pennsylvania Avenue NW have been coordinating with individual agency IT leads since mid-June. The coordination is informal at this stage — no dedicated program office has been stood up, and no city funding line has been attached to the effort.
Smaller cultural organizations in neighborhoods like NoMa and Shaw, which received DC Humanities Council digitization grants between 2020 and 2023, face a steeper climb. Several of those grant-funded projects produced image archives without deduplication protocols built into the original scope of work, leaving recipient organizations to clean up the backlog with limited technical staff and no additional grant support currently available.
For residents or researchers who use the DC Public Library's digital portal or the Smithsonian's online collections, the practical effect of this week's work should eventually be faster load times and more reliable search results. Archivists recommend that anyone submitting digital donations to a DC institution this summer ask explicitly whether the receiving organization has a deduplication step in its intake workflow — a simple question that can prevent the same problem from recurring.