A production-grade system for cleaning, standardizing, and validating GEDCOM 5.5.1 datasets with audit trails and zero data loss guarantees — combining a reusable Python library, a CLI, and a full processing workflow designed for repeatable, professional-grade outcomes that transform raw genealogy files into import-ready assets while preserving every relationship.
Genealogy data is high-value but messy, fragmented, and expensive to clean. This project delivers an automated, integrity-first pipeline that transforms raw GEDCOM files into standardized, import-ready assets while preserving every relationship. It combines a reusable Python library, a CLI, and a full processing workflow designed for repeatable, professional-grade outcomes.
The genealogy market continues to grow as genetic testing services expand and millions of consumers build digital family trees across platforms like Ancestry, FamilySearch, and MyHeritage. GEDCOM, the universal standard for genealogical data exchange, has been in use since 1984, and decades of files created by different software, in different languages, with different naming and date conventions have produced a massive corpus of data that is technically interoperable but practically inconsistent. Professional genealogists and archival organizations spend thousands of hours manually reviewing records, resolving duplicates, and standardizing formats before data can be migrated between platforms or published in authoritative databases. Our platform automates this work with surgical precision while maintaining the zero-data-loss guarantee that genealogy professionals require.
The platform architecture is deliberately dual-mode: the gedfix Python library can be imported as a dependency into larger systems including web applications, SaaS platforms, and batch processing pipelines, while the CLI provides a standalone tool for professional genealogists who need immediate access to the full processing capability without writing code. This dual distribution strategy maximizes addressable market by serving both enterprise integration buyers and individual professional users from a single codebase with shared validation logic and processing algorithms.
Genealogical data spans centuries of naming conventions, calendar systems, and geographic boundaries. Our normalization engine handles the full complexity of real-world family history data while maintaining complete traceability for every transformation applied.
Historical records contain dates in dozens of formats: Julian and Gregorian calendars with dates that fall in the transition period between 1582 and 1752 depending on country of origin, partial dates with only a year or month specified, approximate dates marked as "about," "circa," "before," or "after," date ranges with "between X and Y" constructions, and dual-dated entries from the Old Style to New Style calendar transition where both dates appear on the original record. Our date normalization engine parses all recognized GEDCOM date prefixes and modifiers, converts them to a canonical ISO-style representation, and preserves the original value in an AutoFix note attached to the record so that researchers can always verify the transformation and revert if the automated interpretation was incorrect.
The parser implements the complete GEDCOM 5.5.1 date grammar including the DATE_VALUE, DATE_PERIOD, DATE_RANGE, and DATE_APPROXIMATED productions. Ambiguous month-day orderings in numeric dates are resolved using configurable locale rules, defaulting to the convention matching the file's declared SOUR system. Dates that cannot be parsed with high confidence are flagged for human review rather than silently guessed, ensuring that the automated pipeline never introduces incorrect date interpretations into the dataset.
Names in genealogical records vary wildly due to spelling evolution over centuries, transliteration differences between alphabets, patronymic versus surname systems, and cultural naming patterns where individuals are known by multiple names across their lifetime. The platform uses fuzzy matching algorithms powered by rapidfuzz to identify potential duplicates across variant spellings, with separate scoring for given names, surnames, and maiden names. The matching pipeline applies Jaro-Winkler similarity scoring for phonetic variants, Levenshtein distance for typographical errors, and Soundex grouping for names that sound identical but are spelled differently. Match candidates that exceed the configurable similarity threshold are proposed as potential merges with a confidence score that indicates the strength of the evidence.
Place names receive equivalent treatment, normalizing historical locality names to modern geographic identifiers while preserving the original place descriptions for archival accuracy. The place standardization engine recognizes hierarchical place structures — village, parish, county, state, country — and resolves historical administrative divisions that no longer exist to their modern equivalents. A record citing a village in a Prussian province that was absorbed into Poland after 1945 is mapped to its current Polish administrative location while retaining the original German place name in the notes field, maintaining both historical accuracy and modern searchability.
The scanner reads the complete GEDCOM file and produces a prioritized issue report categorized by severity: critical issues that indicate data corruption or structural violations, warnings for inconsistencies that may affect data quality, and informational notes for stylistic variations that differ from the target standard. Each issue includes the affected record identifier, line number, current value, suggested correction, and confidence level for the proposed fix.
The standardization engine applies date normalization, name formatting, place resolution, and encoding corrections while recording every transformation in an AutoFix note system. Each note captures the original value, the applied transformation rule, the resulting value, and a timestamp. This audit trail satisfies archival provenance requirements and enables selective rollback of individual changes without affecting the rest of the processed dataset.
The deduplication engine analyzes the full relationship graph before proposing any merge, ensuring parent-child links, marriage records, and sibling groups remain intact. Merge candidates are scored by confidence level using a composite metric that weighs name similarity, date proximity, place overlap, and relationship context. Low-confidence matches are flagged for human review with a side-by-side comparison report rather than processed automatically.
Every processing run creates a timestamped backup of the original dataset before any modifications occur. The system maintains a complete change log recording every transformation with before-and-after values, enabling selective rollback of individual changes or full restoration to any prior state. This architecture gives researchers the confidence to run aggressive cleaning operations knowing that no information can be permanently lost.
Post-processing verification confirms that no relationships were severed, no records were orphaned, and all GEDCOM structural rules remain satisfied after cleaning. The verifier compares relationship counts, individual counts, and family group checksums between the original and processed files, flagging any discrepancies for investigation. This step serves as the final quality gate before the processed file is approved for export.
The export module generates GEDCOM files optimized for import into specific target platforms including Ancestry, FamilySearch, Gramps, and RootsMagic. Each export profile adjusts encoding, tag usage, and structural conventions to match the target platform's import parser requirements. Alongside the GEDCOM output, the system generates an actionable review report documenting all changes made, issues remaining, and recommendations for manual follow-up.
Built as both a reusable Python library and a command-line interface using Click, the platform integrates into existing genealogy workflows or operates as a standalone processing engine with full control over every pipeline stage.
The CLI is built on Click with subcommands for each processing stage: scan, normalize, deduplicate, verify, export, and rollback. Each command accepts the input GEDCOM file path, an optional configuration file for tuning behavior, and output directory specification. The scan command produces a structured JSON report that can be piped into downstream analysis tools or rendered as a human-readable summary in the terminal. Processing commands support a dry-run mode that previews all proposed changes without writing to disk, allowing operators to review the transformation plan before committing. The CLI also implements a full end-to-end pipeline command that chains all stages in sequence with automatic backup creation, progress reporting, and verification checking, providing a single-command experience for the common case of complete file processing.
The gedfix library exposes a clean Python API with typed dataclass models for GEDCOM individuals, families, and events. The GedcomFile class provides the entry point for loading and manipulating GEDCOM data programmatically. Each processing module — scanner, normalizer, deduplicator, verifier — is instantiated independently and operates on the in-memory model, enabling selective application of processing stages and custom pipeline construction. The library uses ged4py for low-level GEDCOM parsing and serialization, python-dateutil for robust date handling, and rapidfuzz for efficient fuzzy string matching with configurable scoring thresholds. All processing functions are designed to be composable: a downstream system can import only the date normalizer, or the name matcher, or the full pipeline, depending on its specific requirements.
Explicit verification and rollback tooling ensures zero data loss during processing. Every transformation is reversible and documented with before-and-after values. The verification stage provides mathematical guarantees that relationship counts, individual counts, and family group structures are preserved through the cleaning process. This rigor exceeds what any competing tool offers and is essential for organizations whose data has legal, medical, or archival significance.
Documented change trails provide complete transparency for every modification, meeting archival standards for provenance tracking that professional genealogists, national archives, and forensic genealogy firms require. The AutoFix note system creates a permanent record within the GEDCOM file itself, so the processing history travels with the data even when the file is transferred between systems or organizations.
The packaged CLI and library can be integrated into larger systems, SaaS platforms, and automated data migration workflows through standard Python packaging with pyproject.toml distribution. Genealogy SaaS platforms can embed the processing engine to offer automated data cleaning as a premium feature. Archive management systems can integrate the scanner to validate incoming GEDCOM submissions before acceptance into their collections.
Unlike simple record-matching tools that compare individuals in isolation, our deduplication engine understands the family relationship graph and validates that every proposed merge preserves the structural integrity of the family tree. This graph-aware approach prevents the catastrophic errors — orphaned children, circular ancestry, severed marriages — that plague tools which treat genealogy records as flat database rows rather than nodes in a connected graph.
Quantifiable improvements in genealogy data quality and processing efficiency across real-world datasets containing tens of thousands of individual records and complex multi-generational family structures.
Issues auto-resolved with documented changes
Records processed with zero relationship breakage
Faster than manual review and correction
Broken relationships across all processing runs
Professional genealogists typically spend days or weeks manually reviewing and correcting GEDCOM files containing tens of thousands of records. Our automated pipeline reduces that work to minutes while maintaining higher consistency than manual review. The structured issue reports eliminate the need for line-by-line file inspection, focusing human attention on the five percent of issues that require judgment rather than the ninety-five percent that follow deterministic correction rules.
The zero-broken-relationships guarantee means organizations can run the pipeline with confidence that family connections will never be severed or corrupted during processing. Post-processing verification provides mathematical confirmation that the output file contains the same number of individuals, families, and relationship links as the input, with all structural invariants preserved. This guarantee is essential for archives, heritage organizations, and forensic genealogy firms where data integrity has legal and evidentiary significance.
Core library with CLI, scanner, normalizer, and deduplicator
Processing workflows, integrity checks, and rollback tools
Processing workspaces, exports, reports, and methodology
Learn how we can build custom data processing and integrity systems for your organization.