validator

Sync Wikidata to target catalogs and enrich items when extra data is available.

checks

A set of checks to validate Wikidata against target catalogs.

soweego.validator.checks.bio(catalog, entity, wd_cache=None)[source]

Validate identifiers against available biographical data.

Look for:

  • birth and death dates

  • birth and death places

  • gender

Also generate statements based on additional data found in the given catalog. They can be used to enrich Wikidata items.

How it works:

  1. gather data from the given catalog

  2. gather data from relevant Wikidata items

  3. look for shared data between pairs of Wikidata and catalog items:

  • when the pair does not share any data, the catalog identifier should be marked with a deprecated rank

  • when the catalog item has more data than the Wikidata one, it should be added to the latter

Parameters
  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • wd_cache – (optional) a dict of links gathered from Wikidata in a previous run

Return type

Tuple[Defaultdict[~KT, ~VT], Iterator[+T_co], Dict[~KT, ~VT]]

Returns

3 objects

  1. dict of identifiers that should be deprecated

  2. generator of statements that should be added

  3. dict of biographical data gathered from Wikidata

soweego.validator.checks.dead_ids(catalog, entity, wd_cache=None)[source]

Look for dead identifiers in Wikidata. An identifier is dead if it does not exist in the given catalog when this function is executed.

Dead identifiers should be marked with a deprecated rank in Wikidata.

How it works:

  1. gather identifiers of the given catalog from relevant Wikidata items

  2. look them up in the given catalog

  3. if an identifier is not in the given catalog anymore, it should be deprecated

Parameters
  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • wd_cache – (optional) a dict of identifiers gathered from Wikidata in a previous run

Return type

Tuple[Defaultdict[~KT, ~VT], Dict[~KT, ~VT]]

Returns

the dict pair of dead identifiers and identifiers gathered from Wikidata

Validate identifiers against available links.

Also generate statements based on additional links found in the given catalog. They can be used to enrich Wikidata items.

How it works:

  1. gather links from the given catalog

  2. gather links from relevant Wikidata items

  3. look for shared links between pairs of Wikidata and catalog items:

  • when the pair does not share any link, the catalog identifier should be marked with a deprecated rank

  • when the catalog item has more links than the Wikidata one, they should be added to the latter

  1. try to extract third-party identifiers from extra links

Parameters
  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • wd_cache – (optional) a dict of links gathered from Wikidata in a previous run

Return type

Tuple[Defaultdict[~KT, ~VT], List[~T], List[~T], Dict[~KT, ~VT]]

Returns

4 objects

  1. dict of identifiers that should be deprecated

  2. list of third-party identifiers that should be added

  3. list of URLs that should be added

  4. dict of links gathered from Wikidata

enrichment

Enrichment of Wikidata based on data available in target catalogs.

soweego.validator.enrichment.generate_statements(catalog, entity, bucket_size=5000)[source]

Generate statements about works by people.

How it works:

  1. gather works and people identifiers of the given catalog from relevant Wikidata items

  2. leverage catalog relationships between works and people

  3. build Wikidata statements accordingly

Parameters
  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • bucket_size (int) – (optional) how many target IDs should be looked up in the given catalog. For efficiency purposes

Return type

Iterator[Tuple]

Returns

the statements generator, yielding (work_QID, PID, person_QID, person_catalog_ID) tuple s