wikidata¶
Collect data from Wikidata through the SPARQL endpoint and the Web API.
api_requests¶
Set of specific Web API requests for Wikidata data collection.
- soweego.wikidata.api_requests.build_session()[source]¶
Build the HTTP session for interaction with the Wikidata API.
Log in if credentials are found, otherwise go ahead with an unauthenticated session. If a previously cached session has expired, build a new one.
- Return type
- Returns
the HTTP session to interact with the Wikidata API
- soweego.wikidata.api_requests.get_biodata(qids)[source]¶
Collect biographical data for a given set of Wikidata items.
- soweego.wikidata.api_requests.get_data_for_linker(catalog, entity, qids, url_pids, ext_id_pids_to_urls, qids_and_tids, fileout)[source]¶
Collect relevant data for linking Wikidata to a given catalog. Dump the result to a given output stream.
This function uses multithreaded parallel processing.
- Parameters
catalog (
str) –{'discogs', 'imdb', 'musicbrainz'}. A supported catalogentity (
str) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entityurl_pids (
Set[str]) – a set of PIDs holding URL values. Returned bysoweego.wikidata.sparql_queries.url_pids()ext_id_pids_to_urls (
Dict) – a{PID: {formatter_URL: (id_regex, url_regex,)} }dict. Returned bysoweego.wikidata.sparql_queries.external_id_pids_and_urls()fileout (
TextIO) – a file stream open for writingqids_and_tids (
Dict) – a{QID: {'tid': {catalog_ID_set} }dict. Populated bysoweego.commons.data_gathering.gather_target_ids()
- Return type
- soweego.wikidata.api_requests.get_links(qids, url_pids, ext_id_pids_to_urls)[source]¶
Collect sitelinks and third-party links for a given set of Wikidata items.
- Parameters
url_pids (
Set[str]) – a set of PIDs holding URL values. Returned bysoweego.wikidata.sparql_queries.url_pids()ext_id_pids_to_urls (
Dict) – a{PID: {formatter_URL: (id_regex, url_regex,)} }dict. Returned bysoweego.wikidata.sparql_queries.external_id_pids_and_urls()
- Return type
- Returns
the generator yielding
(QID, URL)pairs
- soweego.wikidata.api_requests.get_url_blacklist()[source]¶
Retrieve a blacklist with URL domains of low-quality sources.
- soweego.wikidata.api_requests.parse_value(value)[source]¶
Parse a value returned by the Wikidata API into standard Python objects.
The parser supports the following Wikidata data types:
string > str
URL > str
monolingual text > str
time > tuple
(time, precision)item > set
{item_labels}
sparql_queries¶
Set of specific SPARQL queries for Wikidata data collection.
- soweego.wikidata.sparql_queries.external_id_pids_and_urls()[source]¶
Retrieve Wikidata properties holding identifier values, together with their formatter URLs and regular expressions.
- soweego.wikidata.sparql_queries.random() x in the interval [0, 1).¶
- soweego.wikidata.sparql_queries.run_query(query_type, class_qid, catalog_pid, result_per_page)[source]¶
Run a filled SPARQL query template against the Wikidata endpoint with eventual paging.
- Parameters
- Return type
- Returns
the query result generator, yielding
(QID, identifier_or_URL)pairs, orQIDstrings only, depending on query_type
- soweego.wikidata.sparql_queries.subclasses_of(qid)[source]¶
Retrieve subclasses of a given Wikidata ontology class.