data_utils

The data.data_utils module provides utility functions for querying data.

Functions

wikirepo.data.data_utils._get_dir_fxns_dict(dir_name=None)[source]

Generates a jump table dictionary of all modules in the cwd and the get_ functions within.

Parameters:
dir_namestr (default=None)

The name of the directory within wikirepo.data.

Returns:
fxns_dictdict

A dictionary with keys being module names and contents being dictionaries of standardized indexes and functions.

Notes

Indexes all data querying functions within wikirepo directories.

wikirepo.data.data_utils._check_data_assertions(timespan=None, interval=None, **kwargs)[source]

Checks standardized data assertions across functions given local functional arguments.

Parameters:
timespantwo element tuple or listcontains datetime.date or tuple (default=None: (date.today(), date.today()))

A tuple or list that defines the start and end dates to be queried.

Note 1: if True, then the full timespan from 1-1-1 to the current day will be queried.

Note 2: passing a single entry will query for that date only.

intervalstr

The time interval over which queries will be made.

Note 1: see data.time_utils for options.

Note 2: if None, then only the most recent data will be queried.

Returns:
The results of a series of standardized assertions.
wikirepo.data.data_utils._get_max_workers(multicore)[source]
wikirepo.data.data_utils.incl_dir_idxs(dir_name=None, descriptions=False)[source]

Returns the included indexes in the given directory - the file names of its scripts.

Parameters:
dir_namestr (default=None)

The name of the directory within wikirepo.data.

descriptionsbool (default=False)

Whether also return the descriptions of the indexes.

Returns:
included_indexeslist

A list of included indexes as derived by module names.

wikirepo.data.data_utils.gen_base_df(locations=None, depth=None, timespan=None, interval=None, col_name='data')[source]

Generates a baseline dataframe to be filled with queried data.

Parameters:
locationsstr, list, or lctn_utils.LocationsDict (contains strs)optional (default=None)

The locations to query.

depthint (default=None)

The depth from the given lbls or qids that data should go.

Note: this uses ‘P150’ (contains administrative territorial entity).

timespantwo element tuple or listcontains datetime.date or tuple (default=None: (date.today(), date.today()))

A tuple or list that defines the start and end dates to be queried.

Note 1: if True, then the full timespan from 1-1-1 to the current day will be queried.

Note 2: passing a single entry will query for that date only.

intervalstr

The time interval over which queries will be made.

Note 1: see data.time_utils for options.

Note 2: if None, then only the most recent data will be queried.

col_namestr (default=data)

The name of the column into which queried data should be merged.

Returns:
base_dfpd.DataFrame

A df that is ready to have queried data added to it.

wikirepo.data.data_utils.assign_to_column(df=None, locations=None, depth=None, interval=None, col_name='data', props=None, assign='all', span=False)[source]

Assigns Wikidata property values to a designated column of a given df.

Parameters:
dfpd.DataFrmae

A df (likely base_df) to which values should be assigned.

locationsstr, list, or lctn_utils.LocationsDict (contains strs)optional (default=None)

The locations to query.

depthint (default=None)

The depth from the given lbls or qids that data should go.

Note: this uses ‘P150’ (contains administrative territorial entity).

intervalstr

The time interval over which queries will be made.

Note 1: see data.time_utils for options.

Note 2: if None, then only the most recent data will be queried.

col_namestr

A column in df to which properties should be assigned.

propslist or dict

The properties to be assigned.

assignstr (default=all)

The type of assignment.

spanbool (default=False)

Whether to check for P580 ‘start time’ and P582 ‘end time’ to create spans.

Returns:
dfpd.DataFrame

The df after assignment.

wikirepo.data.data_utils.gen_base_and_assign_to_column(locations=None, depth=None, timespan=None, interval=None, col_name='data', props=None, assign=None, span=False)[source]

Combines data_utils.gen_base_df and data_utils.assign_to_column.

wikirepo.data.data_utils.assign_to_cols(df=None, locations=None, depth=None, sub_pid=None, interval=None, col_prefix='d', props=None, assign='all', span=False)[source]

Assigns Wikidata property values from a qualifier to a designated column of a given df.

Parameters:
dfpd.DataFrmae

A df (likely base_df) to which values should be assigned.

locationsstr, list, or lctn_utils.LocationsDict (contains strs)optional (default=None)

The locations to query.

depthint (default=None)

The depth from the given lbls or qids that data should go.

Note: this uses ‘P150’ (contains administrative territorial entity).

sub_pidstr (default=None)

The Wikidata property that subsets time values.

intervalstr

The time interval over which queries will be made.

Note 1: see data.time_utils for options.

Note 2: if None, then only the most recent data will be queried.

col_prefixstr (default=d)

The prefix for columns that are a created from sub_prop values.

propslist or dict

The properties to be assigned.

assignstr (default=all)

The type of assignment.

spanbool (default=False)

Whether to check for P580 ‘start time’ and P582 ‘end time’ to create spans.

Returns:
dfpd.DataFrame

The df after assignment.

wikirepo.data.data_utils.gen_base_and_assign_to_cols(locations=None, depth=None, sub_pid=None, timespan=None, interval=None, col_name=None, col_prefix='d', props=None, assign=None, span=False)[source]

Combines data_utils.gen_base_df and data_utils.assign_to_cols.

wikirepo.data.data_utils.query_wd_prop(dir_name=None, ents_dict=None, locations=None, depth=None, timespan=None, interval=None, pid=None, sub_pid=None, col_name=None, col_prefix=None, ignore_char='', span=False)[source]

Queries a Wikidata property for the given continent(s).

Parameters:
dir_namestr (default=None)

The name of the directory within wikirepo.data.

ents_dictwd_utils.EntitiesDictoptional (default=None)

A dictionary with keys being Wikidata QIDs and values being their entities.

locationsstr, list, or lctn_utils.LocationsDict (contains strs)optional (default=None)

The locations to query.

depthint (default=None)

The depth from the given lbls or qids that data should go.

Note: this uses ‘P150’ (contains administrative territorial entity).

timespantwo element tuple or listcontains datetime.date or tuple (default=None: (date.today(), date.today()))

A tuple or list that defines the start and end dates to be queried.

Note 1: if True, then the full timespan from 1-1-1 to the current day will be queried.

Note 2: passing a single entry will query for that date only.

intervalstr

The time interval over which queries will be made.

Note 1: see data.time_utils for options.

Note 2: if None, then only the most recent data will be queried.

pidstr (default=None)

The Wikidata property that is being queried.

sub_pidstr (default=None)

The Wikidata property that subsets time values.

col_namestr (default=None)

The name of the column into which queried data should be merged.

col_prefixstr (default=None)

The prefix for columns that are a created from sub_pid values.

Note: only use col_name or col_prefix.

ignore_charstr

Characters in the output that should be ignored.

spanbool (default=False)

Whether to check for P580 ‘start time’ and P582 ‘end time’ to create spans.

Returns:
df, ents_dictpd.DataFrame, wd_utils.EntitiesDict

A df of location names and the given property for the given timespan with an updated EntitiesDict.

wikirepo.data.data_utils.query_repo_dir(dir_name=None, ents_dict=None, locations=None, depth=None, timespan=None, interval=None, verbose=True, **kwargs)[source]

Generates a df of statistics for given a psk directory and geographic as well as time intervals.

Parameters:
dir_namestr (default=None)

The name of the directory within wikirepo.data.

ents_dictwd_utils.EntitiesDictoptional (default=None)

A dictionary with keys being Wikidata QIDs and values being their entities.

locationsstr, list, or lctn_utils.LocationsDict (contains strs)optional (default=None)

The locations to query.

depthint (default=None)

The depth from the given lbls or qids that data should go.

Note: this uses ‘P150’ (contains administrative territorial entity).

timespantwo element tuple or listcontains datetime.date or tuple (default=None: (date.today(), date.today()))

A tuple or list that defines the start and end dates to be queried.

Note 1: if True, then the full timespan from 1-1-1 to the current day will be queried.

Note 2: passing a single entry will query for that date only.

intervalstr

The time interval over which queries will be made.

Note 1: see data.time_utils for options.

Note 2: if None, then only the most recent data will be queried.

verbosebool (default=True)

Whether to show a tqdm progress bar for the query.

Returns:
df_datapd.DataFrame

A df of locations and data given timespan and demographic index arguments.

wikirepo.data.data_utils.interp_by_subset(df=None, depth=None, col_name='data', **kwargs)[source]

Subsets a df by a given geo_lvl and interpolates the given column.

Parameters:
dfpd.DataFrame (default=None)

A dataframe to have a column interpolated.

depthint (default=None)

The depth from the given lbls or qids that data should go.

Note: this uses ‘P150’ (contains administrative territorial entity).

col_namestr

A column in df that is to be interpolated.

Returns:
df_interpolatedpd.DataFrame

The original df with the given column interpolated based on **kwargs.

Notes

pd.DataFrame.interpolate and scipy.interpolate **kwargs are passed.

wikirepo.data.data_utils.sum_df_prop_vals(df=None, target_lctn=None, vals_lctn=None, lctn_col=None, time_col=None, prop_col=None, subtract=False, drop_vals_lctn=False)[source]

Adds or subtracts values in a dataframe for given locations.

Parameters:
dfpd.DataFrame (default=None)

A dataframe with property rows to be summed.

target_lctnstr (default=None)

The name of the location to which values should be added.

Note: subtract=True subtracts from this location.

vals_lctnstr (default=None)

The name of the location that has values to be added or subtracted.

lctn_colstr (default=None)

The name of the column in which the locations are defined.

time_colstr (default=None)

The name of the column in which times are defined.

prop_colstr (default=None)

The name of the column in which property values are found.

subtractbool (default=False)

Whether values from vals_lctn should be subtracted from target_lctn.

drop_vals_lctnbool (default=False)

Whether rows with vals_lctn should be dropped from df.

Returns:
df_newpd.DataFrame

The dataframe post arithmetic operations.

wikirepo.data.data_utils.split_col_val_dates(df=None, col=None)[source]

Adds or subtracts values in a dataframe for given locations.

Parameters:
dfpd.DataFrame (default=None)

A dataframe with property rows to be summed.

colstr (default=None)

The name of the column which should have its dates split to another column.

Returns:
df_newpd.DataFrame

The dataframe post splitting the date from the values.

wikirepo.data.data_utils.count_df_prop_vals(df=None, col=None, percent=False)[source]

Returns value counts of df columns sorted alphabetically.

Parameters:
dfpd.DataFrame (default=None)

Regional data including population size.

colstr (default=None)

The column in df in which counts should be made.

percentbool (default=False)

Whether to return percentage values.

Returns:
val_counts or pd.value_countsdict or pd.value_counts

Aggregate or percentage value counts.