validator
¶
Sync Wikidata to target catalogs and enrich items when extra data is available.
checks
¶
A set of checks to validate Wikidata against target catalogs.
-
soweego.validator.checks.
bio
(catalog, entity, wd_cache=None)[source]¶ Validate identifiers against available biographical data.
Look for:
birth and death dates
birth and death places
gender
Also generate statements based on additional data found in the given catalog. They can be used to enrich Wikidata items.
How it works:
gather data from the given catalog
gather data from relevant Wikidata items
look for shared data between pairs of Wikidata and catalog items:
when the pair does not share any data, the catalog identifier should be marked with a deprecated rank
when the catalog item has more data than the Wikidata one, it should be added to the latter
-
soweego.validator.checks.
dead_ids
(catalog, entity, wd_cache=None)[source]¶ Look for dead identifiers in Wikidata. An identifier is dead if it does not exist in the given catalog when this function is executed.
Dead identifiers should be marked with a deprecated rank in Wikidata.
How it works:
gather identifiers of the given catalog from relevant Wikidata items
look them up in the given catalog
if an identifier is not in the given catalog anymore, it should be deprecated
- Parameters
- Return type
Tuple
[Defaultdict
[~KT, ~VT],Dict
[~KT, ~VT]]- Returns
the
dict
pair of dead identifiers and identifiers gathered from Wikidata
-
soweego.validator.checks.
links
(catalog, entity, wd_cache=None)[source]¶ Validate identifiers against available links.
Also generate statements based on additional links found in the given catalog. They can be used to enrich Wikidata items.
How it works:
gather links from the given catalog
gather links from relevant Wikidata items
look for shared links between pairs of Wikidata and catalog items:
when the pair does not share any link, the catalog identifier should be marked with a deprecated rank
when the catalog item has more links than the Wikidata one, they should be added to the latter
try to extract third-party identifiers from extra links
enrichment
¶
Enrichment of Wikidata based on data available in target catalogs.
-
soweego.validator.enrichment.
generate_statements
(catalog, entity, bucket_size=5000)[source]¶ Generate statements about works by people.
How it works:
gather works and people identifiers of the given catalog from relevant Wikidata items
leverage catalog relationships between works and people
build Wikidata statements accordingly
- Parameters
catalog (
str
) –{'discogs', 'imdb', 'musicbrainz'}
. A supported catalogentity (
str
) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}
. A supported entitybucket_size (
int
) – (optional) how many target IDs should be looked up in the given catalog. For efficiency purposes
- Return type
Iterator
[Tuple
]- Returns
the statements
generator
, yielding (work_QID, PID, person_QID, person_catalog_ID)tuple
s