validator
¶
Sync Wikidata to target catalogs and enrich items when extra data is available.
checks
¶
A set of checks to validate Wikidata against target catalogs.
- soweego.validator.checks.bio(catalog, entity, wd_cache=None)[source]¶
Validate identifiers against available biographical data.
Look for:
birth and death dates
birth and death places
gender
Also generate statements based on additional data found in the target catalog. They can be used to enrich Wikidata items.
How it works:
gather data from the given catalog
gather data from relevant Wikidata items
look for shared data between pairs of Wikidata and catalog items:
when the pair does not share any data, the catalog identifier should be marked with a deprecated rank
when the catalog item has more data than the Wikidata one, it should be added to the latter
- Parameters
- Return type
Optional
[Tuple
[defaultdict
,Iterator
,Iterator
,Iterator
,dict
]]- Returns
5 objects
dict
of identifiers that should be deprecatedgenerator
of statements that should be addedgenerator
of shared statements that should be referencedgenerator
of statements found in Wikidata but not in the target catalogdict
of biographical data gathered from Wikidata
or
None
if the target catalog has no biographical data.
- soweego.validator.checks.dead_ids(catalog, entity, wd_cache=None)[source]¶
Look for dead identifiers in Wikidata. An identifier is dead if it does not exist in the given catalog when this function is executed.
Dead identifiers should be marked with a deprecated rank in Wikidata.
How it works:
gather identifiers of the given catalog from relevant Wikidata items
look them up in the given catalog
if an identifier is not in the given catalog anymore, it should be deprecated
- Parameters
- Return type
- Returns
the
dict
pair of dead identifiers and identifiers gathered from Wikidata
- soweego.validator.checks.links(catalog, entity, url_blacklist=False, wd_cache=None)[source]¶
Validate identifiers against available links.
Also generate statements based on additional links found in the target catalog. They can be used to enrich Wikidata items.
How it works:
gather links from the target catalog
gather links from relevant Wikidata items
look for shared links between pairs of Wikidata and catalog items:
when the pair does not share any link, the catalog identifier should be marked with a deprecated rank
when the catalog item has more links than the Wikidata one, they should be added to the latter
try to extract third-party identifiers from extra links
- Parameters
catalog (
str
) –{'discogs', 'imdb', 'musicbrainz'}
. A supported catalogentity (
str
) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}
. A supported entityurl_blacklist – (optional) whether to apply a blacklist of URL domains. Default:
False
wd_cache – (optional) a
dict
of links gathered from Wikidata in a previous run. Default:None
- Return type
Optional
[Tuple
[defaultdict
,list
,list
,list
,list
,list
,dict
]]- Returns
7 objects
dict
of identifiers that should be deprecatedlist
of third-party identifiers that should be addedlist
of URLs that should be addedlist
of third-party identifiers that should be referencedlist
of URLs that should be referencedlist
of URLs found in Wikidata but not in the target catalogdict
of links gathered from Wikidata
or
None
if the target catalog has no links.
enrichment
¶
Enrichment of Wikidata based on data available in target catalogs.
- soweego.validator.enrichment.generate_statements(catalog, entity, bucket_size=5000)[source]¶
Generate statements about works by people.
How it works:
gather works and people identifiers of the given catalog from relevant Wikidata items
leverage catalog relationships between works and people
build Wikidata statements accordingly
- Parameters
catalog (
str
) –{'discogs', 'imdb', 'musicbrainz'}
. A supported catalogentity (
str
) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}
. A supported entitybucket_size (
int
) – (optional) how many target IDs should be looked up in the given catalog. For efficiency purposes
- Return type
- Returns
the statements
generator
, yielding (work_QID, PID, person_QID, person_catalog_ID)tuple
s