validator¶
Sync Wikidata to target catalogs and enrich items when extra data is available.
checks¶
A set of checks to validate Wikidata against target catalogs.
- soweego.validator.checks.bio(catalog, entity, wd_cache=None)[source]¶
Validate identifiers against available biographical data.
Look for:
birth and death dates
birth and death places
gender
Also generate statements based on additional data found in the target catalog. They can be used to enrich Wikidata items.
How it works:
gather data from the given catalog
gather data from relevant Wikidata items
look for shared data between pairs of Wikidata and catalog items:
when the pair does not share any data, the catalog identifier should be marked with a deprecated rank
when the catalog item has more data than the Wikidata one, it should be added to the latter
- Parameters
- Return type
Optional[Tuple[defaultdict,Iterator,Iterator,Iterator,dict]]- Returns
5 objects
dictof identifiers that should be deprecatedgeneratorof statements that should be addedgeneratorof shared statements that should be referencedgeneratorof statements found in Wikidata but not in the target catalogdictof biographical data gathered from Wikidata
or
Noneif the target catalog has no biographical data.
- soweego.validator.checks.dead_ids(catalog, entity, wd_cache=None)[source]¶
Look for dead identifiers in Wikidata. An identifier is dead if it does not exist in the given catalog when this function is executed.
Dead identifiers should be marked with a deprecated rank in Wikidata.
How it works:
gather identifiers of the given catalog from relevant Wikidata items
look them up in the given catalog
if an identifier is not in the given catalog anymore, it should be deprecated
- Parameters
- Return type
- Returns
the
dictpair of dead identifiers and identifiers gathered from Wikidata
- soweego.validator.checks.links(catalog, entity, url_blacklist=False, wd_cache=None)[source]¶
Validate identifiers against available links.
Also generate statements based on additional links found in the target catalog. They can be used to enrich Wikidata items.
How it works:
gather links from the target catalog
gather links from relevant Wikidata items
look for shared links between pairs of Wikidata and catalog items:
when the pair does not share any link, the catalog identifier should be marked with a deprecated rank
when the catalog item has more links than the Wikidata one, they should be added to the latter
try to extract third-party identifiers from extra links
- Parameters
catalog (
str) –{'discogs', 'imdb', 'musicbrainz'}. A supported catalogentity (
str) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entityurl_blacklist – (optional) whether to apply a blacklist of URL domains. Default:
Falsewd_cache – (optional) a
dictof links gathered from Wikidata in a previous run. Default:None
- Return type
Optional[Tuple[defaultdict,list,list,list,list,list,dict]]- Returns
7 objects
dictof identifiers that should be deprecatedlistof third-party identifiers that should be addedlistof URLs that should be addedlistof third-party identifiers that should be referencedlistof URLs that should be referencedlistof URLs found in Wikidata but not in the target catalogdictof links gathered from Wikidata
or
Noneif the target catalog has no links.
enrichment¶
Enrichment of Wikidata based on data available in target catalogs.
- soweego.validator.enrichment.generate_statements(catalog, entity, bucket_size=5000)[source]¶
Generate statements about works by people.
How it works:
gather works and people identifiers of the given catalog from relevant Wikidata items
leverage catalog relationships between works and people
build Wikidata statements accordingly
- Parameters
catalog (
str) –{'discogs', 'imdb', 'musicbrainz'}. A supported catalogentity (
str) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entitybucket_size (
int) – (optional) how many target IDs should be looked up in the given catalog. For efficiency purposes
- Return type
- Returns
the statements
generator, yielding (work_QID, PID, person_QID, person_catalog_ID)tuples