validator

Sync Wikidata to target catalogs and enrich items when extra data is available.

checks

A set of checks to validate Wikidata against target catalogs.

soweego.validator.checks.bio(catalog, entity, wd_cache=None)[source]

Validate identifiers against available biographical data.

Look for:

  • birth and death dates

  • birth and death places

  • gender

Also generate statements based on additional data found in the target catalog. They can be used to enrich Wikidata items.

How it works:

  1. gather data from the given catalog

  2. gather data from relevant Wikidata items

  3. look for shared data between pairs of Wikidata and catalog items:

  • when the pair does not share any data, the catalog identifier should be marked with a deprecated rank

  • when the catalog item has more data than the Wikidata one, it should be added to the latter

Parameters
  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • wd_cache – (optional) a dict of links gathered from Wikidata in a previous run

Return type

Optional[Tuple[defaultdict, Iterator, Iterator, Iterator, dict]]

Returns

5 objects

  1. dict of identifiers that should be deprecated

  2. generator of statements that should be added

  3. generator of shared statements that should be referenced

  4. generator of statements found in Wikidata but not in the target catalog

  5. dict of biographical data gathered from Wikidata

or None if the target catalog has no biographical data.

soweego.validator.checks.dead_ids(catalog, entity, wd_cache=None)[source]

Look for dead identifiers in Wikidata. An identifier is dead if it does not exist in the given catalog when this function is executed.

Dead identifiers should be marked with a deprecated rank in Wikidata.

How it works:

  1. gather identifiers of the given catalog from relevant Wikidata items

  2. look them up in the given catalog

  3. if an identifier is not in the given catalog anymore, it should be deprecated

Parameters
  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • wd_cache – (optional) a dict of identifiers gathered from Wikidata in a previous run

Return type

Tuple[DefaultDict, Dict]

Returns

the dict pair of dead identifiers and identifiers gathered from Wikidata

Validate identifiers against available links.

Also generate statements based on additional links found in the target catalog. They can be used to enrich Wikidata items.

How it works:

  1. gather links from the target catalog

  2. gather links from relevant Wikidata items

  3. look for shared links between pairs of Wikidata and catalog items:

  • when the pair does not share any link, the catalog identifier should be marked with a deprecated rank

  • when the catalog item has more links than the Wikidata one, they should be added to the latter

  1. try to extract third-party identifiers from extra links

Parameters
  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • url_blacklist – (optional) whether to apply a blacklist of URL domains. Default: False

  • wd_cache – (optional) a dict of links gathered from Wikidata in a previous run. Default: None

Return type

Optional[Tuple[defaultdict, list, list, list, list, list, dict]]

Returns

7 objects

  1. dict of identifiers that should be deprecated

  2. list of third-party identifiers that should be added

  3. list of URLs that should be added

  4. list of third-party identifiers that should be referenced

  5. list of URLs that should be referenced

  6. list of URLs found in Wikidata but not in the target catalog

  7. dict of links gathered from Wikidata

or None if the target catalog has no links.

enrichment

Enrichment of Wikidata based on data available in target catalogs.

soweego.validator.enrichment.generate_statements(catalog, entity, bucket_size=5000)[source]

Generate statements about works by people.

How it works:

  1. gather works and people identifiers of the given catalog from relevant Wikidata items

  2. leverage catalog relationships between works and people

  3. build Wikidata statements accordingly

Parameters
  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • bucket_size (int) – (optional) how many target IDs should be looked up in the given catalog. For efficiency purposes

Return type

Iterator[Tuple]

Returns

the statements generator, yielding (work_QID, PID, person_QID, person_catalog_ID) tuple s