wikidata
¶
Collect data from Wikidata through the SPARQL endpoint and the Web API.
api_requests
¶
Set of specific Web API requests for Wikidata data collection.
-
soweego.wikidata.api_requests.
build_session
[source]¶ Build the HTTP session for interaction with the Wikidata API.
Log in if credentials are found, otherwise go ahead with an unauthenticated session. If a previously cached session has expired, build a new one.
- Return type
- Returns
the HTTP session to interact with the Wikidata API
-
soweego.wikidata.api_requests.
get_biodata
(qids)[source]¶ Collect biographical data for a given set of Wikidata items.
-
soweego.wikidata.api_requests.
get_data_for_linker
(catalog, entity, qids, url_pids, ext_id_pids_to_urls, qids_and_tids, fileout)[source]¶ Collect relevant data for linking Wikidata to a given catalog. Dump the result to a given output stream.
This function uses multithreaded parallel processing.
- Parameters
catalog (
str
) –{'discogs', 'imdb', 'musicbrainz'}
. A supported catalogentity (
str
) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}
. A supported entityurl_pids (
Set
[str
]) – a set of PIDs holding URL values. Returned bysoweego.wikidata.sparql_queries.url_pids()
ext_id_pids_to_urls (
Dict
[~KT, ~VT]) – a{PID: {formatter_URL: formatter_regex} }
dict. Returned bysoweego.wikidata.sparql_queries.external_id_pids_and_urls()
fileout (
Textio
) – a file stream open for writingqids_and_tids (
Dict
[~KT, ~VT]) – a{QID: {'tid': {catalog_ID_set} }
dict. Populated bysoweego.commons.data_gathering.gather_target_ids()
- Return type
None
-
soweego.wikidata.api_requests.
get_links
(qids, url_pids, ext_id_pids_to_urls)[source]¶ Collect sitelinks and third-party links for a given set of Wikidata items.
- Parameters
url_pids (
Set
[str
]) – a set of PIDs holding URL values. Returned bysoweego.wikidata.sparql_queries.url_pids()
ext_id_pids_to_urls (
Dict
[~KT, ~VT]) – a{PID: {formatter_URL: formatter_regex} }
dict. Returned bysoweego.wikidata.sparql_queries.external_id_pids_and_urls()
- Return type
Iterator
[Tuple
]- Returns
the generator yielding
(QID, URL)
pairs
-
soweego.wikidata.api_requests.
parse_value
(value)[source]¶ Parse a value returned by the Wikidata API into standard Python objects.
The parser supports the following Wikidata data types:
string > str
URL > str
monolingual text > str
time > tuple
(time, precision)
item > set
{item_labels}
sparql_queries
¶
Set of specific SPARQL queries for Wikidata data collection.
-
soweego.wikidata.sparql_queries.
external_id_pids_and_urls
()[source]¶ Retrieve Wikidata properties holding identifier values, together with their formatter URLs and regular expressions.
-
soweego.wikidata.sparql_queries.
random
() → x in the interval [0, 1).¶
-
soweego.wikidata.sparql_queries.
run_query
(query_type, class_qid, catalog_pid, result_per_page)[source]¶ Run a filled SPARQL query template against the Wikidata endpoint with eventual paging.
- Parameters
- Return type
- Returns
the query result generator, yielding
(QID, identifier_or_URL)
pairs, orQID
strings only, depending on query_type
-
soweego.wikidata.sparql_queries.
subclasses_of
(qid)[source]¶ Retrieve subclasses of a given Wikidata ontology class.