data_utils¶
The data.data_utils
module provides utility functions for querying data.
Functions
wikirepo.data.data_utils._get_fxn_idx()
- wikirepo.data.data_utils._get_dir_fxns_dict(dir_name=None)[source]¶
Generates a jump table dictionary of all modules in the cwd and the get_ functions within.
- Parameters:
- dir_namestr (default=None)
The name of the directory within wikirepo.data.
- Returns:
- fxns_dictdict
A dictionary with keys being module names and contents being dictionaries of standardized indexes and functions.
Notes
Indexes all data querying functions within wikirepo directories.
- wikirepo.data.data_utils._check_data_assertions(timespan=None, interval=None, **kwargs)[source]¶
Checks standardized data assertions across functions given local functional arguments.
- Parameters:
- timespantwo element tuple or listcontains datetime.date or tuple (default=None: (date.today(), date.today()))
A tuple or list that defines the start and end dates to be queried.
Note 1: if True, then the full timespan from 1-1-1 to the current day will be queried.
Note 2: passing a single entry will query for that date only.
- intervalstr
The time interval over which queries will be made.
Note 1: see data.time_utils for options.
Note 2: if None, then only the most recent data will be queried.
- Returns:
- The results of a series of standardized assertions.
- wikirepo.data.data_utils.incl_dir_idxs(dir_name=None, descriptions=False)[source]¶
Returns the included indexes in the given directory - the file names of its scripts.
- Parameters:
- dir_namestr (default=None)
The name of the directory within wikirepo.data.
- descriptionsbool (default=False)
Whether also return the descriptions of the indexes.
- Returns:
- included_indexeslist
A list of included indexes as derived by module names.
- wikirepo.data.data_utils.gen_base_df(locations=None, depth=None, timespan=None, interval=None, col_name='data')[source]¶
Generates a baseline dataframe to be filled with queried data.
- Parameters:
- locationsstr, list, or lctn_utils.LocationsDict (contains strs)optional (default=None)
The locations to query.
- depthint (default=None)
The depth from the given lbls or qids that data should go.
Note: this uses ‘P150’ (contains administrative territorial entity).
- timespantwo element tuple or listcontains datetime.date or tuple (default=None: (date.today(), date.today()))
A tuple or list that defines the start and end dates to be queried.
Note 1: if True, then the full timespan from 1-1-1 to the current day will be queried.
Note 2: passing a single entry will query for that date only.
- intervalstr
The time interval over which queries will be made.
Note 1: see data.time_utils for options.
Note 2: if None, then only the most recent data will be queried.
- col_namestr (default=data)
The name of the column into which queried data should be merged.
- Returns:
- base_dfpd.DataFrame
A df that is ready to have queried data added to it.
- wikirepo.data.data_utils.assign_to_column(df=None, locations=None, depth=None, interval=None, col_name='data', props=None, assign='all', span=False)[source]¶
Assigns Wikidata property values to a designated column of a given df.
- Parameters:
- dfpd.DataFrmae
A df (likely base_df) to which values should be assigned.
- locationsstr, list, or lctn_utils.LocationsDict (contains strs)optional (default=None)
The locations to query.
- depthint (default=None)
The depth from the given lbls or qids that data should go.
Note: this uses ‘P150’ (contains administrative territorial entity).
- intervalstr
The time interval over which queries will be made.
Note 1: see data.time_utils for options.
Note 2: if None, then only the most recent data will be queried.
- col_namestr
A column in df to which properties should be assigned.
- propslist or dict
The properties to be assigned.
- assignstr (default=all)
The type of assignment.
- spanbool (default=False)
Whether to check for P580 ‘start time’ and P582 ‘end time’ to create spans.
- Returns:
- dfpd.DataFrame
The df after assignment.
- wikirepo.data.data_utils.gen_base_and_assign_to_column(locations=None, depth=None, timespan=None, interval=None, col_name='data', props=None, assign=None, span=False)[source]¶
Combines data_utils.gen_base_df and data_utils.assign_to_column.
- wikirepo.data.data_utils.assign_to_cols(df=None, locations=None, depth=None, sub_pid=None, interval=None, col_prefix='d', props=None, assign='all', span=False)[source]¶
Assigns Wikidata property values from a qualifier to a designated column of a given df.
- Parameters:
- dfpd.DataFrmae
A df (likely base_df) to which values should be assigned.
- locationsstr, list, or lctn_utils.LocationsDict (contains strs)optional (default=None)
The locations to query.
- depthint (default=None)
The depth from the given lbls or qids that data should go.
Note: this uses ‘P150’ (contains administrative territorial entity).
- sub_pidstr (default=None)
The Wikidata property that subsets time values.
- intervalstr
The time interval over which queries will be made.
Note 1: see data.time_utils for options.
Note 2: if None, then only the most recent data will be queried.
- col_prefixstr (default=d)
The prefix for columns that are a created from sub_prop values.
- propslist or dict
The properties to be assigned.
- assignstr (default=all)
The type of assignment.
- spanbool (default=False)
Whether to check for P580 ‘start time’ and P582 ‘end time’ to create spans.
- Returns:
- dfpd.DataFrame
The df after assignment.
- wikirepo.data.data_utils.gen_base_and_assign_to_cols(locations=None, depth=None, sub_pid=None, timespan=None, interval=None, col_name=None, col_prefix='d', props=None, assign=None, span=False)[source]¶
Combines data_utils.gen_base_df and data_utils.assign_to_cols.
- wikirepo.data.data_utils.query_wd_prop(dir_name=None, ents_dict=None, locations=None, depth=None, timespan=None, interval=None, pid=None, sub_pid=None, col_name=None, col_prefix=None, ignore_char='', span=False)[source]¶
Queries a Wikidata property for the given continent(s).
- Parameters:
- dir_namestr (default=None)
The name of the directory within wikirepo.data.
- ents_dictwd_utils.EntitiesDictoptional (default=None)
A dictionary with keys being Wikidata QIDs and values being their entities.
- locationsstr, list, or lctn_utils.LocationsDict (contains strs)optional (default=None)
The locations to query.
- depthint (default=None)
The depth from the given lbls or qids that data should go.
Note: this uses ‘P150’ (contains administrative territorial entity).
- timespantwo element tuple or listcontains datetime.date or tuple (default=None: (date.today(), date.today()))
A tuple or list that defines the start and end dates to be queried.
Note 1: if True, then the full timespan from 1-1-1 to the current day will be queried.
Note 2: passing a single entry will query for that date only.
- intervalstr
The time interval over which queries will be made.
Note 1: see data.time_utils for options.
Note 2: if None, then only the most recent data will be queried.
- pidstr (default=None)
The Wikidata property that is being queried.
- sub_pidstr (default=None)
The Wikidata property that subsets time values.
- col_namestr (default=None)
The name of the column into which queried data should be merged.
- col_prefixstr (default=None)
The prefix for columns that are a created from sub_pid values.
Note: only use col_name or col_prefix.
- ignore_charstr
Characters in the output that should be ignored.
- spanbool (default=False)
Whether to check for P580 ‘start time’ and P582 ‘end time’ to create spans.
- Returns:
- df, ents_dictpd.DataFrame, wd_utils.EntitiesDict
A df of location names and the given property for the given timespan with an updated EntitiesDict.
- wikirepo.data.data_utils.query_repo_dir(dir_name=None, ents_dict=None, locations=None, depth=None, timespan=None, interval=None, verbose=True, **kwargs)[source]¶
Generates a df of statistics for given a psk directory and geographic as well as time intervals.
- Parameters:
- dir_namestr (default=None)
The name of the directory within wikirepo.data.
- ents_dictwd_utils.EntitiesDictoptional (default=None)
A dictionary with keys being Wikidata QIDs and values being their entities.
- locationsstr, list, or lctn_utils.LocationsDict (contains strs)optional (default=None)
The locations to query.
- depthint (default=None)
The depth from the given lbls or qids that data should go.
Note: this uses ‘P150’ (contains administrative territorial entity).
- timespantwo element tuple or listcontains datetime.date or tuple (default=None: (date.today(), date.today()))
A tuple or list that defines the start and end dates to be queried.
Note 1: if True, then the full timespan from 1-1-1 to the current day will be queried.
Note 2: passing a single entry will query for that date only.
- intervalstr
The time interval over which queries will be made.
Note 1: see data.time_utils for options.
Note 2: if None, then only the most recent data will be queried.
- verbosebool (default=True)
Whether to show a tqdm progress bar for the query.
- Returns:
- df_datapd.DataFrame
A df of locations and data given timespan and demographic index arguments.
- wikirepo.data.data_utils.interp_by_subset(df=None, depth=None, col_name='data', **kwargs)[source]¶
Subsets a df by a given geo_lvl and interpolates the given column.
- Parameters:
- dfpd.DataFrame (default=None)
A dataframe to have a column interpolated.
- depthint (default=None)
The depth from the given lbls or qids that data should go.
Note: this uses ‘P150’ (contains administrative territorial entity).
- col_namestr
A column in df that is to be interpolated.
- Returns:
- df_interpolatedpd.DataFrame
The original df with the given column interpolated based on **kwargs.
Notes
pd.DataFrame.interpolate and scipy.interpolate **kwargs are passed.
- wikirepo.data.data_utils.sum_df_prop_vals(df=None, target_lctn=None, vals_lctn=None, lctn_col=None, time_col=None, prop_col=None, subtract=False, drop_vals_lctn=False)[source]¶
Adds or subtracts values in a dataframe for given locations.
- Parameters:
- dfpd.DataFrame (default=None)
A dataframe with property rows to be summed.
- target_lctnstr (default=None)
The name of the location to which values should be added.
Note: subtract=True subtracts from this location.
- vals_lctnstr (default=None)
The name of the location that has values to be added or subtracted.
- lctn_colstr (default=None)
The name of the column in which the locations are defined.
- time_colstr (default=None)
The name of the column in which times are defined.
- prop_colstr (default=None)
The name of the column in which property values are found.
- subtractbool (default=False)
Whether values from vals_lctn should be subtracted from target_lctn.
- drop_vals_lctnbool (default=False)
Whether rows with vals_lctn should be dropped from df.
- Returns:
- df_newpd.DataFrame
The dataframe post arithmetic operations.
- wikirepo.data.data_utils.split_col_val_dates(df=None, col=None)[source]¶
Adds or subtracts values in a dataframe for given locations.
- Parameters:
- dfpd.DataFrame (default=None)
A dataframe with property rows to be summed.
- colstr (default=None)
The name of the column which should have its dates split to another column.
- Returns:
- df_newpd.DataFrame
The dataframe post splitting the date from the values.
- wikirepo.data.data_utils.count_df_prop_vals(df=None, col=None, percent=False)[source]¶
Returns value counts of df columns sorted alphabetically.
- Parameters:
- dfpd.DataFrame (default=None)
Regional data including population size.
- colstr (default=None)
The column in df in which counts should be made.
- percentbool (default=False)
Whether to return percentage values.
- Returns:
- val_counts or pd.value_countsdict or pd.value_counts
Aggregate or percentage value counts.