Washington DC's archival institutions are quietly confronting a problem that has compounded for years and can no longer be deferred: tens of thousands of duplicate digital images clogging storage systems at a moment when every dollar spent on data infrastructure is under scrutiny. The question now is not whether to act, but how fast and at whose expense.
The pressure is acute in 2026. Federal workforce restructuring under the Trump administration, combined with efficiency mandates tied to DOGE cost reviews, has forced agencies and their downstream partners — including DC-based cultural nonprofits and municipal archives — to justify every line of technology spending. Duplicate image files, which can account for 20 to 40 percent of a typical institutional digital collection according to industry benchmarks published by the Digital Preservation Coalition, represent both wasted storage costs and a genuine preservation risk: when duplicates proliferate without metadata controls, staff lose track of which version is authoritative.
Who Holds the Files — and Who Decides What Stays
The DC Public Library system, which operates its flagship Martin Luther King Jr. Memorial Library on G Street NW, has been expanding its digital collections since the building's 2020 renovation. The library's Special Collections division holds photographic records stretching back to the late 19th century, many of which were digitized in multiple passes over the past decade — often by different vendors using different resolution standards — leaving a thicket of overlapping files. The decision about which version to designate canonical, and which to delete, is not merely technical: it determines what historians and journalists can access through the library's online portal.
Several blocks away, the DC Office of Planning maintains its own Geographic Information System image repository, used for zoning reviews in rapidly changing neighborhoods including Anacostia and NoMa. Planners there have noted internally that storage redundancy in aerial survey images from successive years creates version-control complications when case officers pull records for development hearings. With NoMa alone seeing dozens of active permit applications at any given time along the New York Avenue corridor, a mis-filed or duplicate baseline image can delay a decision by weeks.
The Smithsonian Institution, which operates 19 museums and galleries across the National Mall and beyond, faces the largest-scale version of this challenge. The Smithsonian's digital asset management program has been working since at least 2022 to implement deduplication protocols across its collections management systems, a process that involves not just deleting redundant files but verifying checksums, updating catalog records, and auditing rights clearances for each retained image. That work does not happen cheaply: industry estimates for a mid-size institutional deduplication project run between $150,000 and $500,000 depending on collection size and staff capacity.
The Decision Timeline Is Tightening
Three choices sit on the table for DC institutions navigating this in the second half of 2026. First, institutions can pursue in-house deduplication using open-source tools such as Apache Tika or vendor platforms, absorbing the staff time cost. Second, they can contract the work to a digital preservation firm — several operate out of offices in the broader DMV region — at higher upfront cost but faster throughput. Third, and most politically fraught given the current federal climate, they can apply for grant funding through the Institute of Museum and Library Services, whose budget has itself been subject to congressional pressure this year.
Mayor Muriel Bowser's office has signaled that the city's FY2027 budget will prioritize basic services over technology infrastructure, which narrows the municipal funding window for DCPL and the Office of Planning. Institutions that wait for a more favorable budget cycle risk a different kind of cost: commercial cloud storage pricing, which has risen roughly 8 percent year-over-year for high-availability tiers, makes carrying duplicate data increasingly expensive by the month.
For archivists and records managers watching this play out, the immediate practical step is a collection audit — ideally completed before October 1, the start of the federal fiscal year — that quantifies the scope of duplication and produces a written policy on version hierarchy. Without that document, no funding application, internal or external, will get far. The files are not going anywhere on their own.