A production-grade system for cleaning, standardizing, and validating GEDCOM 5.5.1 datasets with audit trails and zero data loss guarantees - transforming raw genealogy data into import-ready assets while preserving every relationship.
Genealogy data is high-value but messy, fragmented, and expensive to clean. This project delivers an automated, integrity-first pipeline that transforms raw GEDCOM files into standardized, import-ready assets while preserving every relationship.
The $4B+ genealogy market demands professional-grade data processing. By combining a reusable Python library with a CLI and full processing workflow, we enable organizations to perform repeatable, auditable data migrations at scale. As genetic testing services expand and millions of consumers build digital family trees, the demand for reliable data normalization infrastructure is growing faster than existing solutions can serve.
GEDCOM, the universal standard for genealogical data exchange, has been in use since 1984. Decades of files created by different software, in different languages, with different conventions have produced a massive corpus of data that is technically interoperable but practically inconsistent. Our platform addresses this gap with surgical precision, turning fragmented records into research-grade datasets that genealogists and archivists can trust.
Genealogical data spans centuries of naming conventions, calendar systems, and geographic boundaries. Our normalization engine handles the full complexity of real-world family history data.
Historical records contain dates in dozens of formats: Julian and Gregorian calendars, partial dates with only a year or month, approximate dates marked as "about" or "circa," dual-dated entries from the calendar transition period, and culturally specific conventions like Hebrew or French Republican calendars. Our date normalization engine parses all recognized formats, converts them to a canonical representation, and preserves the original value in an AutoFix note so that researchers can always verify the transformation.
Names in genealogical records vary wildly due to spelling changes, transliteration differences, and cultural naming patterns. The platform uses fuzzy matching algorithms powered by rapidfuzz to identify potential duplicates across variant spellings, maiden versus married names, and abbreviated forms. Place names receive similar treatment, normalizing historical place names to modern geographic identifiers while preserving original locality descriptions for archival accuracy.
Comprehensive issue detection with prioritized reporting. Identifies date inconsistencies, name variations, duplicate records, and relationship anomalies across the entire dataset.
Automated cleaning with AutoFix notes and audit trails. Date normalization, name standardization, and place formatting with full traceability of every transformation applied.
Integrity verification with rollback capabilities. Export to major genealogy platforms with confidence in data quality and complete processing documentation.
Built as both a reusable Python library and a command-line interface, the platform integrates into existing genealogy workflows or operates as a standalone processing engine.
Merging duplicate records in genealogy data is uniquely challenging because every individual exists within a web of family relationships. Naively merging two records that represent the same person can orphan children, create circular ancestry loops, or sever spousal connections. Our deduplication engine analyzes the full relationship graph before proposing any merge, ensuring that parent-child links, marriage records, and sibling groups remain intact throughout the process. Merge candidates are scored by confidence level, and low-confidence matches are flagged for human review rather than processed automatically.
Every processing run creates a timestamped backup of the original dataset before any modifications occur. The system maintains a complete change log that records every transformation with before-and-after values, enabling selective rollback of individual changes or full restoration to any prior state. This architecture gives researchers the confidence to run aggressive cleaning operations knowing that no information can be permanently lost.
Explicit verification and rollback tooling ensures zero data loss during processing. Every transformation is reversible and documented.
Documented change trails provide complete transparency for every modification, meeting archival standards for provenance tracking.
Packaged CLI and library can be integrated into larger systems, SaaS platforms, and automated data migration workflows.
Quantifiable improvements in genealogy data quality and processing efficiency across real-world datasets.
Professional genealogists typically spend days or weeks manually reviewing and correcting GEDCOM files containing tens of thousands of records. Our automated pipeline reduces that work to minutes while maintaining higher consistency than manual review. The zero-broken-relationships guarantee means organizations can run the pipeline with confidence that family connections will never be severed or corrupted during processing.
Archives and services with large data migration needs and quality requirements. Heritage organizations, national archives, and genealogy societies managing collections of hundreds of thousands of records benefit from automated quality assurance at scale.
Legacy family tree product migrations and platform updates requiring data integrity. When genealogy platforms sunset or users switch services, our pipeline ensures lossless data transfer between systems with full validation reporting.
Premium integrity-verified processing for professional genealogists and academic researchers. Forensic genealogy firms and DNA-based identification services require documented chain-of-custody data handling that our audit trail provides.
gedfix/ with pyproject.toml packaging
Complete scripts in scripts/ directory
PROJECT_COMPLETE.md documentation
DATA_INTEGRITY_TOOLS.md methodology
Learn how we can build custom data processing and integrity systems for your organization.