Working paper, prepared for peer review. Version 1.0, 2026-06-02.
Abstract
Commercial business-to-business (B2B) "sales intelligence" datasets (ZoomInfo, Apollo, Cognism, Dun & Bradstreet, People Data Labs and others) are built largely from web crawling, email-signature mining, and opt-in contributory networks, achieving broad person-level coverage at the cost of well-documented data-protection exposure. We ask a complementary, under-studied question: how far can a legally clean, commercially redistributable global B2B dataset be built from free and open-data / open-source (OSS) sources alone, attribute by attribute and jurisdiction by jurisdiction? We construct a standardized, adversarially-verified audit instrument that scores each jurisdiction against a fixed eight-attribute rubric (company name, national identifier, address, industry code, website, phone, email, representative name), and apply it across coverage that is jurisdiction-complete by construction (enumerated from ISO 3166-1, not UN membership), with deep audits of ~90 economies and a universal identity backstop (GLEIF/LEI) spanning all 249 jurisdictions. We find a sharp, consistent structure: company identity and industry classification are broadly free and redistributable; contact channels (phone, email, website) are almost universally absent from free open data, present in only four jurisdictions (Norway, Brazil, Mexico, and partially Denmark/Romania); and representative names, where free, are personal data gated by GDPR-class regimes or, in two cases, by sanctions. We show that the binding constraint is licence, not availability, many free-to-view registries explicitly forbid redistribution. We benchmark these findings against the academic data-fusion and record-linkage literature and against the disclosed sourcing methods of proprietary vendors, and we measure our own curated "Gold" dataset, finding it to be the inverse of the commercial profile: broad global company identity (10.6M entities) but narrow, US-biased, single-source person-level contactability (covering 1.9% of companies). We conclude with a per-attribute resale-legality matrix and argue that a legally-grounded global B2B contact dataset is achievable only by pairing an open-registry identity spine with a downstream enrichment layer, not by registry harvesting alone.
Keywords: open government data, business registries, B2B data, firmographics, entity resolution, data fusion, data quality, GDPR, legitimate interest, data redistribution licensing.
1. Introduction
1.1 Motivation
A modern B2B go-to-market motion depends on a dataset that maps, for each target company, (a) its identity and firmographics (name, registration number, industry, size, location) and (b) the contactable individuals within it (name, title, work email, phone). The commercial market for such data is large and mature, dominated by vendors whose coverage is impressive but whose sourcing methods, opt-in "community" contact-syncing from users' inboxes, email-signature mining, large-scale web scraping, and ML inference of email addresses, have repeatedly attracted regulatory and litigation attention (Section 5).
An alternative supply exists in plain sight: the official company registers that almost every jurisdiction maintains, increasingly published as open government data, plus the Global Legal Entity Identifier system. The central question of this paper is whether these open sources can underpin a legally clean, commercially redistributable global B2B dataset, and, critically, for which attributes and which jurisdictions that is true. The answer turns out to be highly structured and, to our knowledge, has not previously been mapped at this granularity with adversarial verification of every claim.
1.2 Contributions
- A standardized, adversarially-verified audit instrument that scores any jurisdiction on a fixed 8-attribute rubric and emits a directly comparable structured matrix (Section 2).
- Jurisdiction-complete coverage by construction, enumeration from ISO 3166-1 (249 codes) rather than UN-193, closing the systematic blind spot around SARs (Hong Kong, Macau), Taiwan, and the offshore corporate-registry centres (Cayman, BVI, Bermuda, Jersey, Guernsey, Isle of Man, Gibraltar), with a universal identity backstop covering every jurisdiction (Section 2.3, 3.5).
- A global, per-attribute findings map for ~90 deep-audited economies (Section 3), with the key result that the open-data frontier is sharply attribute-dependent: identity is free, contact is not.
- Benchmarks against (a) the disclosed sourcing methods and coverage of eight proprietary vendors (Section 5) and (b) the academic record-linkage, truth-discovery and data-quality literature (Section 6).
- An empirical measurement of a real curated "Gold" dataset (10.6M companies, 1.3M people) demonstrating the inverse-profile phenomenon and the survivorship trap in naïve fill-rate reporting (Section 7).
- A per-attribute resale-legality matrix distinguishing company data, personal data, and sanctions-blocked data (Section 8), and a build/buy/skip recommendation (Section 9).
1.3 Scope and definitions
We treat a jurisdiction as a distinct company-registry boundary (ISO 3166-1 alpha-2). For each attribute we record: present/absent/paid; the best free source and its access mode (REST API / bulk download / web-only scrape); authentication and cost; the licence governing commercial reuse and redistribution; and the data-protection regime. We deliberately separate availability (can the value be obtained for free?) from redistributability (may a commercial product store and resell it?), because the two diverge sharply in practice.
2. Method
2.1 The eight-attribute rubric
Every jurisdiction is scored on: company_name, national_id (registration/tax number), address, industry_code, website, phone, email, representative_name (director/officer). The first four are firmographic identity; the last four are the contactability + accountability layer that determines whether a record is actionable for outreach. Each attribute receives a structured record: {present, bestSource, sourceType, access, auth, freeOrPaid, license, redistributionVerdict, fillRate, confidence, citations}.
2.2 Adversarial-verification harness
Audits are produced by a multi-agent research harness: a scoping stage decomposes the question into search angles; parallel web-search agents retrieve candidate sources; fetch-and-extract agents pull falsifiable claims (each with a direct quote and an attribute tag) from official sources; and a three-vote adversarial verification stage attempts to refute each claim (a claim survives only with a quorum of valid votes and fewer than two refutations). A synthesis stage merges surviving claims into the structured matrix. This design privileges precision over recall: claims that cannot be substantiated against primary sources are dropped rather than reported, and "what was NOT found" is recorded explicitly.
2.3 Jurisdiction completeness ("miss none")
Coverage is enumerated from ISO 3166-1 (249 codes) minus an explicit skip-list (Holy See; uninhabited/territorial codes). This closes by construction the common error of working from the 193 UN member states, which omits Hong Kong, Macau, Taiwan, the Crown Dependencies and British Overseas Territories, the Dutch Caribbean, US territories, and Kosovo. A universal identity source (GLEIF/LEI, Section 3.5) provides at minimum company identity for every code, and the audit programme records coverage per ISO-2 so that any jurisdiction lacking a bespoke audit is visible, not silently omitted.
2.4 Measurement limitation discovered in-method (reported for transparency)
The verification harness exhibits a concurrency-correlated failure mode: when many research jobs run simultaneously, the structured-output tool call intermittently fails across a sustained window, and because immediate retries re-enter the same window they cannot always recover, causing genuine claims to be dropped below the survival quorum (manifesting as a degraded or empty report rather than a false-positive one). We instrumented the harness to surface this (a hardFails counter) and established an operating rule of ≤2 concurrent jobs, under which failures fall to zero. This is a property of the measurement apparatus, not of the sources, and it biases toward under-reporting (a country may have a better source than we credited), never toward false claims. Affected runs were re-executed at low concurrency.
2.5 Source-reliability scoring (relation to the platform)
Each surviving claim is tagged with a source-reliability grade and credibility, mirroring the production "truth layer" that ingests these sources into a Bronze→Silver→Gold pipeline with per-attribute champion–challenger source scoring (Section 6.4). Primary government registries are graded most reliable; commercial aggregators and scrapers least.
3. Global findings
3.1 The headline structure
Across ~90 deep-audited jurisdictions the open-data frontier is sharply attribute-dependent and remarkably consistent:
| Attribute | Free & redistributable in open data? | Typical status |
|---|---|---|
| Company name | Broadly yes | Free in the large majority of registries; universal via GLEIF |
| National ID (reg/VAT no.) | Broadly yes | Free wherever the registry is open; cross-referenced by GLEIF registeredAs |
| Address | Mostly yes | Free in most open registries (sometimes partial, e.g. postcode-region only) |
| Industry code | Often yes | NACE/SIC/SSIC/CIIU/SCIAN free in EU/Nordics/LATAM open registries; absent in several common-law registries |
| Website | Rarely | Free only in Norway, Mexico; absent almost everywhere else |
| Phone | Rarely | Free only in Norway, Brazil, Mexico, Denmark (partial), Romania-ANAF (partial) |
| Very rarely | Free only in Norway, Brazil, Mexico; PEC-only in Italy (scrape, restricted) | |
| Representative name | Sometimes, but personal data | Free in DE/AR/CZ/GR/TW/SX/MO/RU(sanctioned)/BR; paid in UK/AU/HK/MY; GDPR-class gating throughout |
The practical consequence: a company-identity spine is buildable for free almost everywhere; a contactability layer is not. Only a handful of jurisdictions publish the "contactability triple" (phone + email + website) in free open data.
3.2 The four open-contactability jurisdictions
The exceptions are decisive for strategy because they are so few:
- Norway, Brønnøysund Enhetsregisteret: no-auth REST API (
data.brreg.no/enhetsregisteret/api) plus nightly JSON/CSV/XLSX bulk, carrying name, org-number, VAT status, address, NACE, website, phone, mobile and email, under the permissive NLOD 2.0 licence (commercial reuse + redistribution, attribution only). The single most complete free B2B source identified. - Brazil, CNPJ (Receita Federal, via the open
minhareceita.organd bulk distributions): carries corporate phone and email plus the quadro de sócios (partner/director names) and CNAE industry, free and no-auth. - Mexico, INEGI DENUE: a free REST API (free emailed token) returning name, SCIAN industry, address, telephone, email and website under the INEGI "Términos de Libre Uso" (commercial + redistribution). Contact fields are sparsely populated but schema-present; representative names are deliberately excluded (LFPDPPP).
- Denmark (partial), CVR (CC-BY 4.0) carries phone; Romania (partial), the free ANAF VAT-validation API returns a corporate phone alongside CAEN.
Everywhere else, contact channels must be manufactured downstream (Section 9).
3.3 Identity is broadly free, and broadly redistributable in the right places
Openly licensed, commercially-redistributable identity registries include: Singapore ACRA (OGL), Canada Corporations Canada (OGL-Canada), France SIRENE (Licence Ouverte 2.0), Germany via OffeneRegister (CC-BY-4.0, incl. officer names), Denmark/Sweden/Finland (CC-BY; Sweden fee-free since Feb 2025 under the EU High-Value-Datasets regulation), Norway (NLOD), Poland KRS (CC0, unauthenticated) and REGON (CC-BY), Greece GEMI (ODC-BY, incl. director names), Argentina (CC-BY, incl. director names) and Chile (CC-BY), Australia ABR (CC-BY, identity only), Ecuador SRI and (with caveats) Israel. The universal backstop, GLEIF, is CC0 (public domain). Industry codes ride along free in most of these.
3.4 Where the registry is closed, paid, or non-redistributable
A large set of economically significant jurisdictions do not offer a free, redistributable firmographic feed:
- Closed / paid registries: Italy (Registro Imprese, OpenCorporates Open Company Data Index 10/100; only OSS validators and scrape-only PEC email are free), Austria (Firmenbuch, per-extract fees), Indonesia (AHU, PNBP per-profile fee, PAID_ONLY), Hungary (free web but bulk explicitly barred, programmatic access paid), Malaysia (SSM directors paid).
- Free to view but non-redistributable (licence is the binding constraint): Nigeria CAC (Terms of Use explicitly forbid resale/redistribution), South Africa CIPC (2014 T&C re-use restrictions; OCDI 20/100), Jersey (forbids reselling registry data), and the offshore centres generally (Section 3.6). United States: there is no federal company registry; SEC EDGAR (excellent free JSON API) covers only SEC filers, state registries vary widely, and Delaware, the dominant incorporation state, is web-form-only and prohibits data mining.
- Identity-open but contact/owner behind a paywall: Netherlands KVK (free open dataset is deliberately anonymized to SBI + 2-digit postcode; name/address/website require the paid KVK API), Hong Kong (free name+BRN+address; director names paid via ICRIS), Australia (ASIC officer/industry data paid).
3.5 The universal identity backstop (GLEIF/LEI)
GLEIF publishes Legal Entity Identifier reference data for ~3.3M entities across all ISO-3166 jurisdictions, free, no-auth, CC0, via both a searchable REST API and a 3×-daily bulk "Golden Copy." It carries legal name, address, the national registry identifier (registeredAs, which cross-references the per-country registries), legal form and status; it carries no website, phone, email, industry, or director name. GLEIF is therefore the identity floor beneath the whole programme: even where a national registry is closed, paid, or (for territories) overlooked, company identity remains reachable and redistributable. We verified this directly for the offshore centres, e.g. a live query returned 46,122 Cayman Islands entities, confirming that "miss none" for identity is mechanically satisfied.
3.6 The UN-193 blind spot and the offshore centres
Working from UN membership would have omitted jurisdictions that matter disproportionately for B2B because of holding-company and SPV density. We audited them explicitly:
- Hong Kong and Taiwan are useful free identity sources (Taiwan's GCIS OData API even carries representative name + industry; both gated, IP-whitelist / custom licence, hence BUILD-GATED).
- The offshore corporate-registry centres (Cayman, BVI, Bermuda, Jersey, Guernsey, Isle of Man, Gibraltar) are uniformly restricted or paid, with no free, programmatic, redistributable feed; beneficial-ownership registers are non-public everywhere except Gibraltar. For these, GLEIF identity is the practical coverage and a bespoke connector is not warranted.
- Macau and Sint Maarten expose director names in free search (constrained by PDPA/local DP); Greenland has no separate register (it folds into Denmark's CVR); Kosovo and Puerto Rico offer free interactive search but no open dataset.
3.7 Sanctions as a distinct gate
Russia's EGRUL is famously open and free including director names, and Belarus similarly, yet both are sanctions-blocked: commercial ingestion/resale by a Western product implicates OFAC/EU restrictive measures, and Russian Decrees 400/729 (2022) deliberately degrade the registry data of sanctioned entities. We therefore treat "open + free" and "usable" as orthogonal for these jurisdictions. Relatedly, the widely-used OpenSanctions aggregation mirrors (incl. a 52M-entity Russian EGRUL dump) are licensed CC-BY-NC, not commercially usable, a reminder that even transparency-oriented aggregations are frequently non-commercial.
4. Attribute deep-dives
Company name & national identifier. The "spine" attributes. Free in the large majority of audited registries and universal via GLEIF; the national identifier (CRN/VAT/UEN/CUIT/CNPJ/SIREN/BIN/统一编号…) is the natural join key, and GLEIF's registeredAs field links the LEI to the national identifier, enabling cross-source resolution. Redistributability follows the licence (Section 8), but for company identity it is permissive in the open-registry set (CC0/CC-BY/OGL/NLOD/Licence-Ouverte).
Address. Free in most open registries, though sometimes degraded for privacy/competition reasons: the Netherlands' free open dataset gives only a 2-digit postcode region (full address is paid); Australia's ABR gives state + postcode only. Where present it is the registered/official address, not necessarily the trading address.
Industry code. Free wherever the registry is open, under national taxonomies (NACE and its derivatives in the EU/EEA; SSIC Singapore; CIIU Ecuador/Colombia; SCIAN/SIC Mexico; SNI Sweden; CNAE Brazil/Romania; PKD Poland; KBLI Indonesia as a taxonomy only). Notably absent from several common-law identity registries (UK Companies House core profile, Australia ABR free tier, Hong Kong CR, US state registers, Canada federal bulk), which carry legal form but not an activity code.
Website. The rarest firmographic. Free only in Norway (hjemmeside) and Mexico DENUE (Sitio_internet). Absent from essentially every other free registry (it is not a field most registrars collect). This is a primary driver of the need for enrichment, since the company domain is the keystone for deriving work emails.
Phone & email. Free, as corporate contact points, only in Norway, Brazil CNPJ, and Mexico DENUE; Denmark CVR carries phone; Romania's ANAF API returns a corporate phone. Italy exposes only certified PEC email via INI-PEC (scrape-only, GDPR-restricted). Everywhere else these fields are simply not in the registry schema. Where a phone/email is tied to a named individual it becomes personal data (Section 8).
Representative / director name. The highest-value accountability attribute and the most legally fraught, because it is always personal data. Free in: Germany (OffeneRegister, CC-BY, officer names), Argentina (IGJ Autoridades, CC-BY), Czechia (ARES VR), Greece (GEMI OpenData), Taiwan (GCIS), Sint Maarten, Macau, Brazil (QSA), and Russia (EGRUL, sanctions-blocked). Paid in the UK (Companies House officers behind the OGL's personal-data exclusion + LIA requirement), Australia (ASIC), Hong Kong (ICRIS), Malaysia (SSM). Anonymized for privacy in Poland's open KRS API (initials + first PESEL digit). Subject to GDPR-class regimes throughout (Section 8).
OSS tooling observed. Beyond raw HTTP/bulk ingestion, the audits surfaced reusable open-source components: python-stdnum and python-codicefiscale (offline validators for VAT/fiscal codes, incl. Italian Partita IVA / Codice Fiscale), CKAN/Socrata clients for the many data.gov.* portals (Singapore, Israel, Ecuador, Colombia, Thailand, Argentina), and numerous national-registry client wrappers (Norway brreg, Denmark cvrapi.dk clients, Czech ARES). No OSS tool substitutes for a missing field, they accelerate ingestion of fields the source already exposes.
5. Benchmark, how proprietary B2B datasets are sourced (and at what coverage)
The disclosed methods converge on a multi-source model: ML-driven web crawling of company domains/filings/news/job-postings for firmographics, contributory networks that harvest contact data from consenting users' inboxes (email headers, signature blocks, address books), licensed third-party feeds, manual research, and ML inference of email patterns.
| Provider | Primary sourcing method | Disclosed scale | Accuracy claim (vendor-asserted) | GDPR posture |
|---|---|---|---|---|
| ZoomInfo | Contributory network (Community Edition mines email signatures/headers/contact books of consenting users) + daily ML crawl of 28M+ domains + licensed partners + in-house research | ~100M+ contacts | High (vendor-asserted; not independently audited) | Legitimate interest (Art. 6(1)(f)); settled a US right-of-publicity class action (Ramos/Martinez) for $29.55M (CA/IL/IN/NV) without admitting wrongdoing; separate email-scraping litigation (Wysocki) |
| Cognism | Community + public + ML + partnerships; "Diamond Data" = human/phone-verified mobiles | ~440M records | 85% general; 98% on the ~10M phone-verified Diamond subset (~2.3% of the DB) | Art. 6(1)(f) + an Art. 14 notified-database model + DNC screening across 13+ jurisdictions |
| People Data Labs | Aggregator/licensing posture, LinkedIn-heavy | ~2.46B person records (marketed 3B+) | Low actionable fill: work email ~3.8%, mobile ~20.2% against the full base; degraded outside NA/W-Europe | Aggregator; reseller-dependent |
| Dun & Bradstreet | Global firmographic compilation + DUNS numbering + partnerships | Hundreds of millions of companies | Firmographic-oriented; long-established | Compliance-oriented; B2B framing |
| Clearbit (HubSpot Breeze) | Web crawl + logo/domain enrichment | Domain-keyed enrichment | Firmographic/technographic | Vendor terms |
| Lusha / others | Contributory + crowdsourced contact | Tens–hundreds of millions | Vendor-asserted | EU exposure varies |
Two findings matter for our comparison. First, headline "billions of records" mask low actionable fill rates per attribute (PDL's 3.8% work-email figure is illustrative): coverage and usable contactability are very different quantities. Second, the broad person coverage is achieved precisely through the contributory/scraping methods that carry the documented legal exposure, exactly the exposure an open-data strategy seeks to avoid. All accuracy figures here are vendor-asserted and not independently audited, and should be read as such.
6. Benchmark, the academic literature
6.1 Entity resolution / record linkage
The arc runs from the Fellegi–Sunter probabilistic record-linkage model, through Magellan (Konda et al., PVLDB 9(12), 2016, reframes entity matching as a whole-pipeline systems problem with a development stage on samples and a production stage at scale, the architectural template for a Silver→Gold materializer), DeepMatcher (Mudgal et al., SIGMOD 2018, deep learning does not beat learning-based matching on structured EM but wins materially on textual and dirty EM), to Ditto (Li et al., PVLDB 14(1), 2020/21, fine-tuning pre-trained transformers as sequence-pair classification, +29–31% F1 over prior SOTA and 96.5% F1 on a real 789K×412K company-matching task). The WDC Products benchmark (Peeters et al., EDBT 2024) shows that all state-of-the-art matchers degrade on unseen entities, directly relevant to a champion–challenger leaderboard, because matcher/source scores measured on seen records do not transfer to the long tail.
6.2 Truth discovery / data fusion
TruthFinder (Yin, Han & Yu, KDD 2007) jointly estimates source trustworthiness and fact confidence by mutual reinforcement, reaching ~95% accuracy vs ~88–95% for majority voting on a book-author benchmark. Crucially, Dong, Berti-Équille & Srivastava (PVLDB 2009) prove that source copying is widespread and that naïve majority voting is actively harmful when sources copy, with copy/quality-aware fusion lifting precision (≈.71→.89), the formal justification for our per-attribute champion–challenger design that discounts correlated/copied sources rather than treating each as an independent vote. The Latent Truth Model (Zhao et al., PVLDB 2012) adds two-sided source quality (sensitivity + specificity) and multi-valued attributes; Knowledge Vault (Dong et al., KDD 2014) is the canonical learned knowledge-fusion formulation producing calibrated per-fact probabilities.
6.3 Data-quality dimensions and their measurement
[Integrates the dedicated data-quality-frameworks research pass, Wang & Strong (1996) 4-category/15-dimension framework; Pipino, Lee & Wang (2002) objective vs subjective assessment and functional forms; Batini et al. (2009, ACM CSUR) methodologies; ISO 8000 / DAMA-DMBOK, with the canonical definitions and measurement formulas for completeness (fill rate per attribute), accuracy, consistency, timeliness/currency (as a function of age vs volatility) and uniqueness, mapped to company/contact data. To be completed on return of the dedicated research run; see Section 7 for the empirical application of these definitions to our Gold dataset.]
6.4 Open vs commercial business-registry quality
[Integrates the OpenCorporates Open Company Data Index (0–100 scoring of machine-readability, bulk availability, open licence and field granularity; e.g. Italy 10/100, South Africa 20/100 as observed in our audits) and the open-government-data-for-corporate-transparency literature. To be completed on return of the dedicated research run.]
6.5 Mapping to a Bronze→Silver→Gold truth layer
The literature maps cleanly onto the production architecture: Bronze = raw per-source records (the connectors of Section 3); entity resolution (6.1) links Bronze records across sources into entities; truth discovery / fusion (6.2) selects the best value per attribute while estimating per-source reliability, instantiated as a champion–challenger leaderboard that, per Dong et al., must discount copying; Silver = the fused canonical entity with calibrated confidence; Gold = the materialized, attributed record. Data-quality dimensions (6.3) become the objective gates (completeness/fill, freshness, uniqueness) that govern promotion between layers.
7. Our curated "Gold" dataset vs the paid datasets, an empirical measurement
We measured a production Gold layer with read-only queries (no sampling, no mocks). The result is the inverse of the commercial profile and contains an instructive measurement trap.
Company layer (gold_company, 10,620,248 rows): broadly global (United States only ~11%; Brazil, France, Indonesia, Netherlands, Australia, Turkey, Germany, Russia, Singapore, India, Canada, Thailand each >100k) but contact-sparse, website 21.4%, domain 22.6%, industry 33.1%, revenue band 23.6%; industry embeddings backfilled to 84.8%. This is the realistic signature of an open-registry-sourced identity layer (cf. Section 3.3).
Person layer (gold_person, 1,300,233 rows): the naïve headline is "100% email, 100% phone, 100% LinkedIn, 100% title", which would be a survivorship artifact, not a quality result, and we explicitly do not report it as quality. The verified facts:
- The person layer covers only 200,982 distinct companies = 1.9% of the 10.62M company population; 98.1% of companies have zero contactable person. This is the true coverage gap.
- It is 76.8% United States (vs ~11% at the company layer, i.e. ~7× more US-concentrated), and 99.98% single-source (one upstream truth-layer/GraphIQ origin).
email_quality_scoreis NULL for all 1.3M rows, the quality scorer never ran, so the "99.7% validated" flag is a coarse boolean and email quality is, in fact, unmeasured. We flag this as a quality-instrumentation gap, not a quality result.
Interpretation. Paid datasets achieve broad person coverage (hundreds of millions to billions of records) via contributory/scraping methods, at partial per-attribute fill and with documented legal exposure. Our dataset is the inverse: broad, global, redistributable company identity but narrow, US-biased, single-source, unscored person-level contactability. Neither registry harvesting (which yields identity, not contact, almost everywhere, Section 3) nor the existing single-source person feed produces broad, legal, global contactability on its own.
8. Per-attribute resale-legality matrix
Combining the per-source licences (Section 3) with the data-protection regimes, the right to resell a curated record is attribute-dependent and basis-dependent:
| Attribute class | Examples | Resale status (synthesised) |
|---|---|---|
| Company identity & firmographics (name, reg/VAT no., address, industry, legal form) | GLEIF (CC0); NO/DK/SE/FI, SG, CA, FR, PL, GR, AR, CL, EC, AU-ABR, DE-company | Redistributable where the source licence permits (CC0 / CC-BY / OGL / NLOD / Licence-Ouverte / ODC-BY). Attribution required for CC-BY/OGL/NLOD. Not personal data. |
| Company contact channels (corporate phone/email/website) | NO, BR, MX, DK(phone), RO-ANAF(phone) | Redistributable under the same open licences as corporate data; but a phone/email tied to a named individual (sole trader, named director) is personal data and shifts to the row below. |
| Representative / director name & personal contact | DE, AR, CZ, GR, TW, BR-QSA, UK(paid), HK(paid) | Personal data under a GDPR-class regime. Redistribution for B2B outreach requires a documented legitimate-interest basis + LIA (UK GDPR Art. 6(1)(f); EU GDPR; LGPD; PDPA; PIPL). The source's open licence does not grant a personal-data basis (explicit for UK OGL, which excludes personal data). |
| Explicitly resale-prohibited | Nigeria CAC (ToS forbids resale/redistribution), Jersey (no reselling registry data), South Africa CIPC (2014 re-use restrictions) | Not redistributable, regardless of free availability. Availability ≠ redistributability. |
| Right-to-object / use-restricted personal data | Macau (Act 8/2005, statutory right to object to direct marketing + pre-disclosure) | Redistributable only with the marketing-objection right honoured; constrains outreach use specifically. |
| ShareAlike / non-commercial | Colombia RUES via datos.gov.co (CC BY-SA copyleft, may conflict with proprietary resale); OpenSanctions mirrors (CC-BY-NC, non-commercial only) | Copyleft requires same-licence redistribution (incompatible with a closed product without legal review); NC bars commercial resale entirely. |
| Sanctions-blocked | Russia EGRUL, Belarus EGR | Not usable by a Western product irrespective of openness (OFAC/EU restrictive measures). |
Operating rule. Build the resale product on the company-identity + firmographics row (clean and broad) and the corporate-contact row (clean where it exists); treat representative names and personal contact as a separately-gated layer requiring a per-jurisdiction legitimate-interest assessment; and exclude the prohibited, NC-only, and sanctions-blocked sources from the redistributable corpus (they may still inform internal resolution but not resale).
9. Recommendations (build / buy / skip)
- Adopt GLEIF as the universal identity spine (CC0, all 249 jurisdictions), it guarantees redistributable company identity everywhere and supplies the
registeredAsjoin key. - Build free connectors to the ~20–25 truly-open national registries for richer identity + industry (NO, DK, SE, FI, SG, CA, FR, DE, PL, CZ, GR, AR, CL, EC, AU, and the contactability-rich BR/MX). Prioritise the four contactability-rich sources (Norway, Brazil, Mexico, + Denmark/Romania phone), they are the only places contact comes free.
- Do not attempt to source contact from registries globally, it is absent almost everywhere (Section 3.1). Broad, legal, global contactability must be manufactured by a downstream enrichment layer off the identity spine (company domain → work-email pattern; OSINT for verification), governed by per-jurisdiction legitimate-interest assessments.
- Skip / exclude from resale: offshore-centre registries (restricted/paid; GLEIF identity suffices), explicitly resale-prohibited registries (Nigeria, Jersey, SA-CIPC), NC/ShareAlike aggregations (OpenSanctions, Colombia-without-review), and sanctions-blocked jurisdictions (Russia, Belarus).
- Buy commercial contact data only where (a) it is unavoidable for a priority market and (b) the vendor's legal basis is defensible (e.g. Cognism's notified-database + DNC model), and treat all vendor accuracy claims as un-audited.
- Close the instrumentation gap surfaced in Section 7: run the email-quality scorer so contactability is measured, not assumed, and report per-attribute fill against the genuine company-population denominator (not the survivorship-filtered person base).
10. Limitations & threats to validity
- Measurement apparatus (under-reporting bias). The verification harness's concurrency-correlated failure (Section 2.4) and the regional-batch claim-budget dilution (a 25-claim budget shared across 10 countries starves the less-documented ones) bias toward under-crediting sources. We mitigated with low concurrency, smaller batches and higher claim budgets, and re-ran affected jurisdictions; residual risk is that a given country has a better free source than we credited, never that we asserted a false one (adversarial verification + "what was NOT found" recording).
- Long-tail coverage. ~90 economies were deep-audited; the remaining (micro-states, several Sub-Saharan and Central-Asian states) are covered by GLEIF identity + the regional pattern, marked as such per ISO-2 so the depth-of-coverage is transparent. This is appropriate for the economically material set but is not a bespoke audit of all 249.
- Vendor claims. All proprietary coverage/accuracy figures (Section 5) are vendor-asserted and not independently audited.
- Single-platform Gold snapshot. Section 7 measures one production platform at one time; the structure (broad identity / narrow US-biased contact) is the generalizable finding, not the exact percentages.
- Licence interpretation is not legal advice. Redistribution verdicts (Section 8) synthesise published licence text and DP regimes; a production resale posture requires jurisdiction-specific legal sign-off, especially for personal data and the ShareAlike/NC edge cases.
11. Appendix A, ISO-2 coverage ledger
Coverage by jurisdiction (D = deep per-jurisdiction audit; C = production connector built; R = regional-batch audit; G = GLEIF-identity backstop only; verdict where established). The denominator is ISO 3166-1 minus the explicit skip-list; every code not marked D/C/R is G (identity guaranteed, no bespoke audit), making any gap visible by construction.
| Region | Jurisdictions (method · verdict) |
|---|---|
| North America | US (D · MIXED), CA (C · BUILD-FREE identity, OGL), MX (D/C · BUILD-FREE contact-rich), territories PR/GL→DK (T3) |
| Western/Northern Europe | NO (R · BUILD-FREE, contact triple), DK (R · BUILD-FREE +phone, covers GL), SE (R · BUILD-FREE), FI (R · BUILD-FREE), IE (R · gated), CH (R · BUILD-GATED), BE (R · gated), AT (R · PAID), PT (R · SKIP firmographics), LU (R · licensing-strong/unconfirmed), DE (C · officers CC-BY), FR (C · Licence Ouverte), NL (D · gated), IT (D · paid/closed), ES (D · gated) |
| Central/Eastern Europe | PL (R · BUILD-FREE CC0), CZ (R · BUILD-FREE), GR (R · BUILD-FREE ODC-BY), HU (R · PAID-automation), RO (R · BUILD-GATED +phone), UA (R · gated, wartime), RU (R · SANCTIONS-BLOCKED), BY (R · SANCTIONS-BLOCKED) |
| LATAM | BR (C · contact-rich), AR (R · BUILD-FREE +directors), CL (R · BUILD-FREE), CO (C · CC-BY-SA gated), EC (C · BUILD-FREE), PE (C · gated), VE (R · SKIP), DO (R · gated), PA/CR (R · unconfirmed→G) |
| APAC | JP (C), KR (C), SG (C · OGL), IN (C · GODL company-only), AU (D · ABR CC-BY identity), ID (D · PAID), TH (R · gated), VN (R · gated), MY (R · directors paid), PK (R · restricted), BD/LK (R · gated), PH (C · stub), TW (D · BUILD-GATED +directors), HK (D · identity free/directors paid) |
| MENA / Israel | IL (C · identity), TR/EG/QA/KW/BH/OM/JO/LB/MA/TN/DZ (R · MENA wave), SA (paid), AE (C · gated) |
| Sub-Saharan Africa | NG (R · search-only, resale-prohibited), ZA (R · restricted), KE/GH/TZ/SN (R · search-only), CI (OHADA), ET/UG/RW (G + regional pattern) |
| Eurasia / C. Asia | KZ (R · BUILD-GATED +directors), GE (R · BUILD-GATED), AZ (R · closed), UZ (R · unconfirmed→G) |
| SARs & offshore | MO (D · free +directors, PDPA), KY/VG/BM/JE/GG/IM/GI (T2 · restricted/paid → G identity), SX (T3 · free +directors), XK (T3 · search-only) |
| Universal | GLEIF/LEI, all 249 ISO-3166 jurisdictions, CC0 identity (the floor beneath every cell above) |
Appendix B, Artifacts & reproducibility
All per-jurisdiction audit matrices, the proprietary-sourcing and academic research outputs, and the Gold-quality analysis are persisted as structured artifacts under backend/research/global_sourcing/ (audits/<ISO2>.json for deep audits, audits/_<REGION>.json for regional batches, paper_research/), with the audit instrument (b2b-country-source-audit) and the program design (PROGRAM.md, JURISDICTIONS.md) version-controlled. Every claim in Sections 3 and 5–7 traces to a verified, citation-bearing artifact.
References
(Consolidated reference list, academic works cited in Section 6 with DOIs/venues, proprietary-vendor primary sources and the ZoomInfo settlement docket cited in Section 5, and the official registry/licence URLs cited per jurisdiction in Section 3, to be compiled from the per-section citations in the persisted artifacts; the data-quality-frameworks and OpenCorporates references attach with Section 6.3/6.4 on completion of the dedicated research run.)