GEDCOM Data Integrity Platform

A production-grade system for cleaning, standardizing, and validating GEDCOM 5.5.1 datasets with audit trails and zero data loss guarantees - transforming raw genealogy data into import-ready assets while preserving every relationship.

0%
Data Loss
100%
Auditable
CLI
+ Library
MIT
Licensed
GEDCOM processing platform interface

Investor Summary

Genealogy data is high-value but messy, fragmented, and expensive to clean. This project delivers an automated, integrity-first pipeline that transforms raw GEDCOM files into standardized, import-ready assets while preserving every relationship.

The $4B+ genealogy market demands professional-grade data processing. By combining a reusable Python library with a CLI and full processing workflow, we enable organizations to perform repeatable, auditable data migrations at scale. As genetic testing services expand and millions of consumers build digital family trees, the demand for reliable data normalization infrastructure is growing faster than existing solutions can serve.

GEDCOM, the universal standard for genealogical data exchange, has been in use since 1984. Decades of files created by different software, in different languages, with different conventions have produced a massive corpus of data that is technically interoperable but practically inconsistent. Our platform addresses this gap with surgical precision, turning fragmented records into research-grade datasets that genealogists and archivists can trust.

Product Capabilities

  • ✓ GEDCOM scanning, issue detection, and prioritized reporting
  • ✓ Date normalization, name and place standardization with AutoFix notes
  • ✓ Relationship-preserving deduplication and safe merges
  • ✓ End-to-end processing workflows with backups and verification
  • ✓ Exported outputs for major genealogy platforms
  • ✓ Actionable review reports and rollback capabilities
  • ✓ DNA match integration and cross-reference validation

Deep Dive: Intelligent Data Normalization

Genealogical data spans centuries of naming conventions, calendar systems, and geographic boundaries. Our normalization engine handles the full complexity of real-world family history data.

Date Parsing and Standardization

Historical records contain dates in dozens of formats: Julian and Gregorian calendars, partial dates with only a year or month, approximate dates marked as "about" or "circa," dual-dated entries from the calendar transition period, and culturally specific conventions like Hebrew or French Republican calendars. Our date normalization engine parses all recognized formats, converts them to a canonical representation, and preserves the original value in an AutoFix note so that researchers can always verify the transformation.

Name and Place Resolution

Names in genealogical records vary wildly due to spelling changes, transliteration differences, and cultural naming patterns. The platform uses fuzzy matching algorithms powered by rapidfuzz to identify potential duplicates across variant spellings, maiden versus married names, and abbreviated forms. Place names receive similar treatment, normalizing historical place names to modern geographic identifiers while preserving original locality descriptions for archival accuracy.

Genealogy data normalization and date parsing pipeline

Processing Pipeline

🔍

Scanning and Detection

Comprehensive issue detection with prioritized reporting. Identifies date inconsistencies, name variations, duplicate records, and relationship anomalies across the entire dataset.

Standardization

Automated cleaning with AutoFix notes and audit trails. Date normalization, name standardization, and place formatting with full traceability of every transformation applied.

Verification and Export

Integrity verification with rollback capabilities. Export to major genealogy platforms with confidence in data quality and complete processing documentation.

Implementation Details

Built as both a reusable Python library and a command-line interface, the platform integrates into existing genealogy workflows or operates as a standalone processing engine.

Family tree visualization and relationship graph output

Relationship-Preserving Deduplication

Merging duplicate records in genealogy data is uniquely challenging because every individual exists within a web of family relationships. Naively merging two records that represent the same person can orphan children, create circular ancestry loops, or sever spousal connections. Our deduplication engine analyzes the full relationship graph before proposing any merge, ensuring that parent-child links, marriage records, and sibling groups remain intact throughout the process. Merge candidates are scored by confidence level, and low-confidence matches are flagged for human review rather than processed automatically.

Backup and Rollback Architecture

Every processing run creates a timestamped backup of the original dataset before any modifications occur. The system maintains a complete change log that records every transformation with before-and-after values, enabling selective rollback of individual changes or full restoration to any prior state. This architecture gives researchers the confidence to run aggressive cleaning operations knowing that no information can be permanently lost.

Technology Stack

Python 3.11 ged4py click rapidfuzz python-dateutil GEDCOM 5.5.1 CLI Framework

Differentiation and Moat

Integrity-First Design

Explicit verification and rollback tooling ensures zero data loss during processing. Every transformation is reversible and documented.

Auditable Processing

Documented change trails provide complete transparency for every modification, meeting archival standards for provenance tracking.

Embeddable Architecture

Packaged CLI and library can be integrated into larger systems, SaaS platforms, and automated data migration workflows.

Results and Impact

Quantifiable improvements in genealogy data quality and processing efficiency across real-world datasets.

95%
Issues Auto-Resolved
50K+
Records Processed
10x
Faster Than Manual
0
Broken Relationships

Professional genealogists typically spend days or weeks manually reviewing and correcting GEDCOM files containing tens of thousands of records. Our automated pipeline reduces that work to minutes while maintaining higher consistency than manual review. The zero-broken-relationships guarantee means organizations can run the pipeline with confidence that family connections will never be severed or corrupted during processing.

GEDCOM data quality metrics and processing results dashboard

Commercial Use Cases

Genealogy Services

Archives and services with large data migration needs and quality requirements. Heritage organizations, national archives, and genealogy societies managing collections of hundreds of thousands of records benefit from automated quality assurance at scale.

Platform Migrations

Legacy family tree product migrations and platform updates requiring data integrity. When genealogy platforms sunset or users switch services, our pipeline ensures lossless data transfer between systems with full validation reporting.

Professional Researchers

Premium integrity-verified processing for professional genealogists and academic researchers. Forensic genealogy firms and DNA-based identification services require documented chain-of-custody data handling that our audit trail provides.

Evidence of Execution

Python Library

gedfix/ with pyproject.toml packaging

Processing Workflows

Complete scripts in scripts/ directory

Quality Metrics

PROJECT_COMPLETE.md documentation

Safety Tooling

DATA_INTEGRITY_TOOLS.md methodology

Interested in This Solution?

Learn how we can build custom data processing and integrity systems for your organization.

Schedule a Demo View All Projects