Thousands of duplicate digital images are clogging the collections of Washington DC's major public archives, costing taxpayers money and slowing down researchers who rely on those records. The problem has moved from a back-office headache into a policy conversation — and the people responsible for managing those collections are pushing for coordinated action before backlogs grow worse.
The issue matters right now for a very specific reason. The Trump administration's restructuring of the federal workforce, carried out in part through the Department of Government Efficiency, has shrunk staffing at federal records agencies. Fewer archivists processing larger digital holdings means that manual deduplication — the time-consuming work of identifying and removing copied image files — is falling further behind. At the same time, DC's own municipal institutions are navigating federal funding uncertainty that limits what software tools they can purchase.
What Officials Are Saying on the Ground
At the DC Public Library system, which operates its flagship Martin Luther King Jr. Memorial Library on G Street NW, administrators have acknowledged internally that their digitization push — accelerated during the pandemic years — generated significant image redundancy. Library officials have said they are evaluating automated deduplication software but have not yet committed to a specific procurement timeline or vendor. The library's Special Collections division, which holds photographic records of neighborhoods including Anacostia and Congress Heights, is among the units most affected.
The National Archives and Records Administration, headquartered at the Pennsylvania Avenue building between 7th and 9th Streets NW, oversees billions of digital files across its holdings. A 2024 Government Accountability Office report — publicly available on GAO's website — found that federal agencies collectively stored an estimated 30 to 40 percent more digital data than they actively needed, a figure that archivists cite when arguing the problem is systemic, not isolated. NARA has piloted AI-assisted image comparison tools in limited contexts, though no agencywide rollout date has been announced.
At the Smithsonian Institution, whose digitization office operates out of its National Museum of Natural History on the Mall, staff have described the duplicate image problem as an inherited consequence of migrating older analog collections onto multiple successive digital platforms over the past two decades. Each platform migration left ghost copies. The Smithsonian's Open Access program, launched in February 2020, released 4.7 million digital records to the public — and experts in digital preservation say that release itself surfaced the scale of duplication for the first time.
What Experts Recommend — and What It Will Cost
Digital preservation specialists at the Washington-based Council on Library and Information Resources have argued publicly that institutions need to adopt hash-based deduplication — a technical process that assigns each image file a unique fingerprint and flags exact copies — as a baseline standard rather than an optional upgrade. Commercial tools capable of handling large archival collections typically run between $15,000 and $80,000 annually depending on collection size, a cost that strains budgets already squeezed by DOGE-era federal grant reductions.
Local historians who work regularly at the Washingtoniana Division of the MLK Library say the practical consequence is real: searching for historical photographs of neighborhoods like NoMa or Shaw sometimes surfaces the same image four or five times under different file names, slowing research and occasionally causing confusion about what constitutes an original record versus a copy.
Mayor Muriel Bowser's office has not issued a specific policy directive on municipal archive deduplication, though the city's Office of the Chief Technology Officer has broader data management guidelines that archivists say could be applied to this problem with modest political will and a dedicated budget line.
For researchers and members of the public who use these collections, the practical advice from digital archivists is straightforward: cross-reference file metadata, check acquisition dates on any image retrieved from municipal or federal portals, and flag apparent duplicates through the formal feedback mechanisms each institution maintains. The MLK Library accepts collection feedback through its Ask-a-Librarian portal. Every flagged duplicate reported is one fewer that staff have to find themselves — and right now, staff capacity is the scarcest resource of all.