wikidata
¶
Collect data from Wikidata through the SPARQL endpoint and the Web API.
api_requests
¶
Set of specific Web API requests for Wikidata data collection.
- soweego.wikidata.api_requests.build_session()[source]¶
Build the HTTP session for interaction with the Wikidata API.
Log in if credentials are found, otherwise go ahead with an unauthenticated session. If a previously cached session has expired, build a new one.
- Return type
- Returns
the HTTP session to interact with the Wikidata API
- soweego.wikidata.api_requests.get_biodata(qids)[source]¶
Collect biographical data for a given set of Wikidata items.
- soweego.wikidata.api_requests.get_data_for_linker(catalog, entity, qids, url_pids, ext_id_pids_to_urls, qids_and_tids, fileout)[source]¶
Collect relevant data for linking Wikidata to a given catalog. Dump the result to a given output stream.
This function uses multithreaded parallel processing.
- Parameters
catalog (
str
) –{'discogs', 'imdb', 'musicbrainz'}
. A supported catalogentity (
str
) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}
. A supported entityurl_pids (
Set
[str
]) – a set of PIDs holding URL values. Returned bysoweego.wikidata.sparql_queries.url_pids()
ext_id_pids_to_urls (
Dict
) – a{PID: {formatter_URL: (id_regex, url_regex,)} }
dict. Returned bysoweego.wikidata.sparql_queries.external_id_pids_and_urls()
fileout (
TextIO
) – a file stream open for writingqids_and_tids (
Dict
) – a{QID: {'tid': {catalog_ID_set} }
dict. Populated bysoweego.commons.data_gathering.gather_target_ids()
- Return type
- soweego.wikidata.api_requests.get_links(qids, url_pids, ext_id_pids_to_urls)[source]¶
Collect sitelinks and third-party links for a given set of Wikidata items.
- Parameters
url_pids (
Set
[str
]) – a set of PIDs holding URL values. Returned bysoweego.wikidata.sparql_queries.url_pids()
ext_id_pids_to_urls (
Dict
) – a{PID: {formatter_URL: (id_regex, url_regex,)} }
dict. Returned bysoweego.wikidata.sparql_queries.external_id_pids_and_urls()
- Return type
- Returns
the generator yielding
(QID, URL)
pairs
- soweego.wikidata.api_requests.get_url_blacklist()[source]¶
Retrieve a blacklist with URL domains of low-quality sources.
- soweego.wikidata.api_requests.parse_value(value)[source]¶
Parse a value returned by the Wikidata API into standard Python objects.
The parser supports the following Wikidata data types:
string > str
URL > str
monolingual text > str
time > tuple
(time, precision)
item > set
{item_labels}
sparql_queries
¶
Set of specific SPARQL queries for Wikidata data collection.
- soweego.wikidata.sparql_queries.external_id_pids_and_urls()[source]¶
Retrieve Wikidata properties holding identifier values, together with their formatter URLs and regular expressions.
- soweego.wikidata.sparql_queries.random() x in the interval [0, 1). ¶
- soweego.wikidata.sparql_queries.run_query(query_type, class_qid, catalog_pid, result_per_page)[source]¶
Run a filled SPARQL query template against the Wikidata endpoint with eventual paging.
- Parameters
- Return type
- Returns
the query result generator, yielding
(QID, identifier_or_URL)
pairs, orQID
strings only, depending on query_type
- soweego.wikidata.sparql_queries.subclasses_of(qid)[source]¶
Retrieve subclasses of a given Wikidata ontology class.