wikidata¶
Collect data from Wikidata through the SPARQL endpoint and the Web API.
api_requests¶
Set of specific Web API requests for Wikidata data collection.
-
soweego.wikidata.api_requests.build_session[source]¶ Build the HTTP session for interaction with the Wikidata API.
Log in if credentials are found, otherwise go ahead with an unauthenticated session. If a previously cached session has expired, build a new one.
- Return type
- Returns
the HTTP session to interact with the Wikidata API
-
soweego.wikidata.api_requests.get_biodata(qids)[source]¶ Collect biographical data for a given set of Wikidata items.
-
soweego.wikidata.api_requests.get_data_for_linker(catalog, entity, qids, url_pids, ext_id_pids_to_urls, qids_and_tids, fileout)[source]¶ Collect relevant data for linking Wikidata to a given catalog. Dump the result to a given output stream.
This function uses multithreaded parallel processing.
- Parameters
catalog (
str) –{'discogs', 'imdb', 'musicbrainz'}. A supported catalogentity (
str) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entityurl_pids (
Set[str]) – a set of PIDs holding URL values. Returned bysoweego.wikidata.sparql_queries.url_pids()ext_id_pids_to_urls (
Dict[~KT, ~VT]) – a{PID: {formatter_URL: formatter_regex} }dict. Returned bysoweego.wikidata.sparql_queries.external_id_pids_and_urls()fileout (
Textio) – a file stream open for writingqids_and_tids (
Dict[~KT, ~VT]) – a{QID: {'tid': {catalog_ID_set} }dict. Populated bysoweego.commons.data_gathering.gather_target_ids()
- Return type
None
-
soweego.wikidata.api_requests.get_links(qids, url_pids, ext_id_pids_to_urls)[source]¶ Collect sitelinks and third-party links for a given set of Wikidata items.
- Parameters
url_pids (
Set[str]) – a set of PIDs holding URL values. Returned bysoweego.wikidata.sparql_queries.url_pids()ext_id_pids_to_urls (
Dict[~KT, ~VT]) – a{PID: {formatter_URL: formatter_regex} }dict. Returned bysoweego.wikidata.sparql_queries.external_id_pids_and_urls()
- Return type
Iterator[Tuple]- Returns
the generator yielding
(QID, URL)pairs
-
soweego.wikidata.api_requests.parse_value(value)[source]¶ Parse a value returned by the Wikidata API into standard Python objects.
The parser supports the following Wikidata data types:
string > str
URL > str
monolingual text > str
time > tuple
(time, precision)item > set
{item_labels}
sparql_queries¶
Set of specific SPARQL queries for Wikidata data collection.
-
soweego.wikidata.sparql_queries.external_id_pids_and_urls()[source]¶ Retrieve Wikidata properties holding identifier values, together with their formatter URLs and regular expressions.
-
soweego.wikidata.sparql_queries.random() → x in the interval [0, 1).¶
-
soweego.wikidata.sparql_queries.run_query(query_type, class_qid, catalog_pid, result_per_page)[source]¶ Run a filled SPARQL query template against the Wikidata endpoint with eventual paging.
- Parameters
- Return type
- Returns
the query result generator, yielding
(QID, identifier_or_URL)pairs, orQIDstrings only, depending on query_type
-
soweego.wikidata.sparql_queries.subclasses_of(qid)[source]¶ Retrieve subclasses of a given Wikidata ontology class.