wikidata

Collect data from Wikidata through the SPARQL endpoint and the Web API.

api_requests

Set of specific Web API requests for Wikidata data collection.

soweego.wikidata.api_requests.build_session[source]

Build the HTTP session for interaction with the Wikidata API.

Log in if credentials are found, otherwise go ahead with an unauthenticated session. If a previously cached session has expired, build a new one.

Return type

requests.Session

Returns

the HTTP session to interact with the Wikidata API

soweego.wikidata.api_requests.get_biodata(qids)[source]

Collect biographical data for a given set of Wikidata items.

Parameters

qids (Set[str]) – a set of QIDs

Return type

Iterator[Tuple[str, str, str]]

Returns

the generator yielding (QID, PID, value) triples

soweego.wikidata.api_requests.get_data_for_linker(catalog, entity, qids, url_pids, ext_id_pids_to_urls, qids_and_tids, fileout)[source]

Collect relevant data for linking Wikidata to a given catalog. Dump the result to a given output stream.

This function uses multithreaded parallel processing.

Parameters
  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • qids (Set[str]) – a set of QIDs

  • url_pids (Set[str]) – a set of PIDs holding URL values. Returned by soweego.wikidata.sparql_queries.url_pids()

  • ext_id_pids_to_urls (Dict) – a {PID: {formatter_URL: (id_regex, url_regex,)} } dict. Returned by soweego.wikidata.sparql_queries.external_id_pids_and_urls()

  • fileout (TextIO) – a file stream open for writing

  • qids_and_tids (Dict) – a {QID: {'tid': {catalog_ID_set} } dict. Populated by soweego.commons.data_gathering.gather_target_ids()

Return type

None

Collect sitelinks and third-party links for a given set of Wikidata items.

Parameters
Return type

Iterator[Tuple]

Returns

the generator yielding (QID, URL) pairs

soweego.wikidata.api_requests.get_url_blacklist()[source]

Retrieve a blacklist with URL domains of low-quality sources.

Return type

Optional[set]

Returns

the set of blacklisted domains, or None in case of issues with the Wikidata Web API

soweego.wikidata.api_requests.parse_value(value)[source]

Parse a value returned by the Wikidata API into standard Python objects.

The parser supports the following Wikidata data types:

  • string > str

  • URL > str

  • monolingual text > str

  • time > tuple (time, precision)

  • item > set {item_labels}

Parameters

value (Union[str, Dict]) – a data value from a call to the Wikidata API

Return type

Union[str, Tuple[str, str], Set[str], None]

Returns

the parsed Python object, or None if parsing failed

soweego.wikidata.api_requests.resolve_qid(term, language='en')[source]

Try to resolve a QID given a search term, in a feeling lucky way.

Parameters
  • term (str) – a search term

  • language – (optional) search in the given language code. Default: en.

Return type

Optional[str]

Returns

the QID of the first result, or None in case of no result

sparql_queries

Set of specific SPARQL queries for Wikidata data collection.

soweego.wikidata.sparql_queries.external_id_pids_and_urls()[source]

Retrieve Wikidata properties holding identifier values, together with their formatter URLs and regular expressions.

Return type

Iterator[Dict]

Returns

the generator yielding {PID: {formatter_URL: formatter_regex} } dicts

soweego.wikidata.sparql_queries.random() → x in the interval [0, 1).
soweego.wikidata.sparql_queries.run_query(query_type, class_qid, catalog_pid, result_per_page)[source]

Run a filled SPARQL query template against the Wikidata endpoint with eventual paging.

Parameters
  • query_type (Tuple[str, str]) – a pair with one of {'identifier', 'links', 'dataset', 'biodata'} and {'class', 'occupation'}

  • class_qid (str) – a Wikidata ontology class, like Q5

  • catalog_pid (str) – a Wikidata property for identifiers, like P1953

  • result_per_page (int) – a page size. Use 0 to switch paging off

Return type

Iterator[Union[Tuple, str]]

Returns

the query result generator, yielding (QID, identifier_or_URL) pairs, or QID strings only, depending on query_type

soweego.wikidata.sparql_queries.subclasses_of(qid)[source]

Retrieve subclasses of a given Wikidata ontology class.

Parameters

qid (str) –

a Wikidata ontology class, like Q5

Return type

Set[str]

Returns

the QIDs of subclasses

soweego.wikidata.sparql_queries.superclasses_of(qid)[source]

Retrieve superclasses of a given Wikidata ontology class.

Parameters

qid (str) –

a Wikidata ontology class, like Q5

Return type

Set[str]

Returns

the QIDs of superclasses

soweego.wikidata.sparql_queries.url_pids()[source]

Retrieve Wikidata properties holding URL values.

Return type

Iterator[str]

Returns

the PIDs generator