Document Preservation: Digital Perpetuity

Digital preservation is the active management of digital objects to ensure they remain accessible, authentic, and readable over technological epochs. Unlike a simple backup, preservation addresses **format obsolescence** and **physical data decay**.

1. The PDF/A-3 Standard

PDF/A (ISO 19005) is the industry standard for long-term archiving. **PDF/A-3** (released in 2012) allows embedding *any* other file format within the PDF/A document.

1.1 Key Preservation Features

* **Self-Containment:** All fonts, color profiles, and metadata are embedded in the file.

* **Device Independence:** Visual appearance is guaranteed regardless of the rendering software or hardware.

* **No External References:** The file cannot rely on external content (e.g., links to external JS or images) that might disappear.

* **Hybrid Archiving (A-3):** You can store the "human-readable" PDF alongside the "machine-readable" source (e.g., an XML or Excel file) in a single archival unit.

2. Bit-Rot Prevention: The Checksum Mandate

"Bit-rot" is the spontaneous flipping of bits on storage media due to cosmic rays, electromagnetic interference, or hardware failure.

2.1 Cryptographic Checksums (Hashing)

Every preserved artifact must be accompanied by a cryptographic hash (e.g., SHA-256). This acts as a digital fingerprint.

| Algorithm | Strength | Purpose |

| :--- | :--- | :--- |

| **MD5** | Broken | Legacy use only; prone to collision attacks. |

| **SHA-256** | High | Current industry standard for integrity verification. |

| **BLAKE3** | Ultra-Fast | Parallelizable hashing for large-scale archival scrubbing. |

2.2 Proactive Scrubbing

A resilient system does not wait for a user to report a corrupt file. It implements **Data Scrubbing**:

1. **Read:** Periodically read all archived data.

2. **Verify:** Re-calculate the hash and compare it to the original "known good" hash.

3. **Repair:** If a mismatch is found, restore the file from an independent redundant copy (3-2-1 backup strategy).

3. Practitioner Insights

3.1 Avoid Proprietary Blobs

Never archive data in proprietary binary formats (e.g., old `.doc` or `.psd`). Always normalize to open, documented standards like PDF/A, TIFF, or plain text (UTF-8).

3.2 Metadata Embedded vs. External

While external databases are fast for searching, essential metadata (Author, Date, Provenance) should be embedded *inside* the preservation file (e.g., XMP metadata in PDF/A) to ensure the file remains self-describing if separated from the database.

3.3 The 3-2-1-1 Rule

* **3** copies of data.

* **2** different media types (e.g., SSD and LTO Tape).

* **1** copy offsite.

* **1** copy **air-gapped** (completely offline to protect against ransomware).