dataio package
Subpackages
- dataio.schemas package
- Subpackages
- dataio.schemas.bonsai_api package
- Submodules
- dataio.schemas.bonsai_api.PPF_fact_schemas_samples module
- dataio.schemas.bonsai_api.PPF_fact_schemas_uncertainty module
- dataio.schemas.bonsai_api.admin module
- dataio.schemas.bonsai_api.base_models module
- dataio.schemas.bonsai_api.correspondences module
- dataio.schemas.bonsai_api.dims module
- dataio.schemas.bonsai_api.external_schemas module
- dataio.schemas.bonsai_api.facts module
- dataio.schemas.bonsai_api.ipcc module
- dataio.schemas.bonsai_api.matrix module
- dataio.schemas.bonsai_api.metadata module
- dataio.schemas.bonsai_api.uncertainty module
- Module contents
- dataio.schemas.bonsai_api package
- Submodules
- dataio.schemas.base_models module
- Module contents
- Subpackages
- dataio.utils package
- Subpackages
- Submodules
- dataio.utils.accounts module
- dataio.utils.connectors module
- dataio.utils.path_manager module
PathBuilder
PathBuilder.b2_version
PathBuilder.balance_raw
PathBuilder.cement
PathBuilder.cement_data
PathBuilder.cleaned_exio_3
PathBuilder.cleaned_fao
PathBuilder.cleaned_forestry
PathBuilder.compose()
PathBuilder.corresp_fao
PathBuilder.dm_coeff
PathBuilder.emissions_coeff
PathBuilder.emissions_intermediate
PathBuilder.emissions_raw
PathBuilder.fao_collection
PathBuilder.fao_processed
PathBuilder.fao_raw
PathBuilder.fao_store
PathBuilder.fert_interm
PathBuilder.fertilisers
PathBuilder.ferts_collection
PathBuilder.fish_markets
PathBuilder.gams_inputs
PathBuilder.heat_markets
PathBuilder.hiot_interm
PathBuilder.hiot_raw
PathBuilder.hiot_with_capital
PathBuilder.hiot_with_iluc
PathBuilder.hiot_with_marg_elect
PathBuilder.iLUC_interm
PathBuilder.iLUC_param
PathBuilder.iLUC_raw
PathBuilder.iea_clean
PathBuilder.iea_interm
PathBuilder.iea_raw_exio
PathBuilder.ipcc_param
PathBuilder.land_use
PathBuilder.lci_act_generic
PathBuilder.lci_cleaned
PathBuilder.lci_concito
PathBuilder.lci_country
PathBuilder.lci_exio4
PathBuilder.lci_fish
PathBuilder.lci_prod_generic
PathBuilder.lci_products
PathBuilder.lci_raw
PathBuilder.lci_vehicles
PathBuilder.list_path_attributes()
PathBuilder.matrix_of_invest
PathBuilder.merged_collected_data
PathBuilder.monetary_tables
PathBuilder.natural_resource
PathBuilder.outlier
PathBuilder.prices
PathBuilder.prod_markets
PathBuilder.property_param
PathBuilder.property_values
PathBuilder.simapro
PathBuilder.supply_intermediate
PathBuilder.supply_raw
PathBuilder.trade_intermediate
PathBuilder.trade_raw
PathBuilder.trade_route
PathBuilder.un_data
PathBuilder.un_data_elab
PathBuilder.use_intermediate
PathBuilder.use_raw
PathBuilder.value_added
PathBuilder.waste_accounts
PathBuilder.waste_markets
- dataio.utils.schema_enums module
APIEndpoints
APIEndpoints.ACTIVITIES
APIEndpoints.ACTIVITY_CORR
APIEndpoints.BASE_URL
APIEndpoints.FOOTPRINT
APIEndpoints.LOCATIONS
APIEndpoints.LOCATION_CORR
APIEndpoints.METADATA
APIEndpoints.PRODUCT
APIEndpoints.PRODUCTS
APIEndpoints.PRODUCT_CORR
APIEndpoints.PROPERTIES
APIEndpoints.RECIPES
APIEndpoints.SUPPLY
APIEndpoints.TOKEN
APIEndpoints.USE
APIEndpoints.get_url()
EmissCompartment
Exio_fert_nutrients
IndexNames
IndexNames.ACT_CODE
IndexNames.AGGREGATION
IndexNames.AGRI_SYSTEM
IndexNames.ANIMAL_CATEGORY
IndexNames.ASSOCIATED_TREAT
IndexNames.CLIMATE_METRIC
IndexNames.COEFFICIENT
IndexNames.COUNTRY_CODE
IndexNames.DESCRIP
IndexNames.DESTIN_COUNTRY
IndexNames.DESTIN_COUNTRY_EDGE
IndexNames.EDGE
IndexNames.EMIS_COMPARTMENT
IndexNames.EMIS_SUBSTANCE
IndexNames.EXIO3
IndexNames.EXIO3_ACT
IndexNames.EXIO_ACCOUNT
IndexNames.EXIO_ACT
IndexNames.EXIO_CNT
IndexNames.EXIO_CNT_acron
IndexNames.EXIO_CODE
IndexNames.EXIO_PRD
IndexNames.FACTOR
IndexNames.FLAG
IndexNames.GLOBAL_AREA
IndexNames.INPUT_PROD
IndexNames.LCI_FLAG
IndexNames.LCI_UNIT
IndexNames.LCI_VALUE
IndexNames.MARKET
IndexNames.NUTRIENT_CONT
IndexNames.OJBECT_CODE
IndexNames.ORIGIN_COUNTRY
IndexNames.ORIGIN_COUNTRY_EDGE
IndexNames.PACKAGED
IndexNames.PACK_CODE
IndexNames.PACK_MARKET
IndexNames.PACK_PROD
IndexNames.PARENT_CLASS
IndexNames.PARENT_CODE
IndexNames.PARENT_LINK
IndexNames.PARENT_NAME
IndexNames.PERIOD
IndexNames.PERIOD_DELAY
IndexNames.POSITION
IndexNames.PRODUCT
IndexNames.PRODUCTION
IndexNames.PROD_CODE
IndexNames.REF_PROD_CODE
IndexNames.REPLACED_MRK
IndexNames.REPLACED_PRODUCT
IndexNames.REPL_FACTOR
IndexNames.RESOURCE
IndexNames.SCENARIO
IndexNames.SHARE
IndexNames.SOURCE
IndexNames.SOURCE_CLASS
IndexNames.SOURCE_CODE
IndexNames.SOURCE_LINK
IndexNames.SOURCE_NAME
IndexNames.SUBSTITUTION_FACTOR
IndexNames.TARGET_CLASS
IndexNames.TARGET_CODE
IndexNames.TARGET_LINK
IndexNames.TARGET_NAME
IndexNames.TRADE_ROUTE_ID
IndexNames.TRANSPORT_MODE
IndexNames.TRANSPORT_MODE_EDGE
IndexNames.UNIT
IndexNames.UNIT_DESTIN
IndexNames.UNIT_FOOTPRINT
IndexNames.UNIT_SOURCE
IndexNames.VALUE
IndexNames.VALUE_FOOTPRINT
IndexNames.VALUE_IN
IndexNames.WASTE_FRACTION
IndexNames.WASTE_MARKET
Property
PropertyEnum
animal_system
data_index_categ
data_index_categ.balance_categ
data_index_categ.balance_columns
data_index_categ.column_categ
data_index_categ.emiss_categ
data_index_categ.fao_animal_system_index
data_index_categ.fao_clean_index
data_index_categ.general_categ
data_index_categ.pack_index
data_index_categ.trade_categ
data_index_categ.waste_sup_col
data_index_categ.waste_sup_index
fao_categ
global_land_categ
ipcc_categ
land_use_categ
- dataio.utils.versions module
- Module contents
Submodules
dataio.arg_parser module
dataio.config module
Created on Fri Jun 10 16:05:31 2022
@author: ReMarkt
- class dataio.config.Config(current_task_name=None, custom_resource_path=None, log_level=30, log_file=None, run_by_user=None, **kwargs)[source]
Bases:
object
- property bonsai_home
environment variable to define the home directory for dataio
- property classif_repository
- property connector_repository
- property corr_repository
- property dataio_root
environment variable to define the home directory for hybrid_sut
- static get_airflow_defaults(filename: str)[source]
Load config from airflow or from src/config/airflow_attributes.config.json :param filename: JSON config file located in the folder ‘src/config’ :type filename: str
Returns: None
- property path_repository: PathBuilder
- property resource_repository: ResourceRepository
- property schema_enums
- property schemas
- property sut_resource_repository
- property version_source
dataio.default module
dataio.load module
Datapackage load module of dataio utility.
- dataio.load.build_cache_key(resource: DataResource) str [source]
- Builds a unique string key for caching based on:
resource.api_endpoint
relevant fields (version, task_name, etc.)
- dataio.load.clean_up_cache(CACHE_DIR, MAX_CACHE_FILES)[source]
Enforce that only up to MAX_CACHE_FILES CSV files remain in CACHE_DIR. Remove the oldest files (by modification time) if there are more than that.
- dataio.load.get_cache_path(key: str, CACHE_DIR='./data_cache/') str [source]
Returns a local file path where the DataFrame is cached, based on the unique cache key.
- dataio.load.load(path: Path, schemas: Dict[str, BaseModel] = None)[source]
This will return an empty dict if the file can’t be found
- dataio.load.load_api(self, resource: DataResource, CACHE_DIR, MAX_CACHE_FILES) DataFrame [source]
Fetches data from the resource’s API endpoint and returns it as a DataFrame. Assumes the endpoint returns CSV text (adjust as needed for JSON, etc.).
- Raises:
ValueError – If api_endpoint is not set or an HTTP error occurs.
- dataio.load.load_matrix_file(path_to_file: Path, schema: MatrixModel, **kwargs)[source]
- dataio.load.load_metadata(path_to_metadata, datapackage_names=None)[source]
Load metadata from a YAML file and convert it into a dictionary with UUIDs as keys and MetaData objects as values. The YAML file is expected to start directly with a list of metadata entries.
- Parameters:
file_path (str) – The path to the YAML file that contains metadata entries.
- Returns:
A dictionary where each key is a UUID (as a string) of a MetaData object and each value is the corresponding MetaData object.
- Return type:
- Raises:
FileNotFoundError – If the specified file does not exist.
yaml.YAMLError – If there is an error in parsing the YAML file.
pydantic.ValidationError – If an item in the YAML file does not conform to the MetaData model.
Examples
Assuming a YAML file located at ‘example.yaml’:
>>> metadata_dict = load_metadata_from_yaml('example.yaml') >>> print(metadata_dict['123e4567-e89b-12d3-a456-426614174000']) MetaData(id=UUID('123e4567-e89b-12d3-a456-426614174000'), created_by=User(...), ...)
- dataio.load.load_table_file(path_to_file: Path, schema: BonsaiBaseModel, **kwargs)[source]
dataio.resources module
- class dataio.resources.CSVResourceRepository(db_path: str, table_name: str = 'resources', **kwargs)[source]
Bases:
object
- class dataio.resources.ResourceRepository(storage_method='local', db_path: str | None = None, table_name: str | None = 'resources', API_token: str | None = None, username: str | None = None, password: str | None = None, cache_dir: str = './data_cache/', MAX_CACHE_FILES: int = 3)[source]
Bases:
object
Repository for managing data resources within a CSV file storage system.
- db_path
Path to the directory containing the resource CSV file.
- Type:
Path
- resources_list_path
Full path to the CSV file that stores resource information.
- Type:
Path
- schema
Schema used for resource data validation and storage.
- Type:
- cache_dir
cache_dir determinds the location of the cached data resources. Default: ./data_cache/
- Type:
- add_or_update_resource_list(resource: DataResource, \*\*kwargs)[source]
Adds a new resource or updates an existing one in the repository.
- add_to_resource_list(resource: DataResource)[source]
Adds a new resource to the repository.
- update_resource_list(resource: DataResource)[source]
Updates an existing resource in the repository.
- add_from_dataframe(data, loc, task_name, task_relation, last_update, \*\*kwargs)[source]
Adds resource information from a DataFrame.
- write_dataframe_for_task(data, name, \*\*kwargs)[source]
Writes a DataFrame to the storage based on resource information.
- write_dataframe_for_resource(data, resource, overwrite)[source]
Validates and writes a DataFrame to the resource location. list_available_resources() Lists all available resources in the repository.
- comment_resource(resource, comment)[source]
Adds a comment to a resource and updates the repository.
- add_from_dataframe(data: DataFrame, loc: Path | str, task_name: str | None = None, task_relation: str = 'output', last_update: date = datetime.date(2025, 6, 27), **kwargs) DataResource [source]
- add_or_update_resource_list(resource: DataResource, storage_method: str | None = None, **kwargs) str [source]
Adds a new resource to the repository or updates it if it already exists.
- Parameters:
resource (DataResource) – The resource data to add or update.
kwargs (dict) – Additional keyword arguments used for extended functionality.
- Return type:
str of the versions uuid
- add_to_resource_list(resource: DataResource, storage_method: str | None = None) None [source]
Appends a new resource to the repository.
- Parameters:
resource (DataResource) – The resource data to add.
- Return type:
str of the generated uuid
- comment_resource(resource: DataResource, comment: str) DataResource [source]
- convert_dataframe(data: DataFrame, original_schema, classifications: dict, units: list[str] | None = None) Tuple[DataFrame, dict[str, set[str]]] [source]
- convert_dataframe_to_bonsai_classification(data: DataFrame, original_schema, units=None) Tuple[DataFrame, dict[str, set[str]]] [source]
- convert_units(data: DataFrame, target_units: list[str]) DataFrame [source]
Converts values in the ‘value’ column of a DataFrame to the specified target units in the list. Units not listed in the target_units remain unchanged.
- Parameters:
dataframe (pd.DataFrame) – A DataFrame with ‘unit’ and ‘value’ columns.
target_units (list) – A list of target units to convert compatible units to. Example: [“kg”, “J”, “m”]
- Returns:
A DataFrame with the converted values and target units.
- Return type:
pd.DataFrame
- convert_units_pandas(data: DataFrame, target_units: list[str]) DataFrame [source]
Converts values in the ‘value’ column of a DataFrame to the specified target units in the list. Units not listed in the target_units remain unchanged.
- Parameters:
dataframe (pd.DataFrame) – A DataFrame with ‘unit’ and ‘value’ columns.
target_units (list) – A list of target units to convert compatible units to. Example: [“kg”, “J”, “m”]
- Returns:
A DataFrame with the converted values and target units.
- Return type:
pd.DataFrame
- get_dataframe_for_resource(res: DataResource, storage_method: str | None = None)[source]
- get_resource_info(storage_method: str | None = None, **filters: dict) DataResource | List[DataResource] [source]
Retrieves resource information based on specified filters, optionally overriding the default storage method for this call.
- Parameters:
- Returns:
Matches found, either a single or multiple.
- Return type:
DataResource or List[DataResource]
- group_and_sum(df, code_column: str, group_columns: list, values_to_sum: list | set)[source]
Grouping function to handle unit compatibility
- harmonize_with_resource(dataframe, storage_method: str | None = None, overwrite=True, **kwargs)[source]
- list_available_resources(storage_method: str | None = None) list[DataResource] [source]
Lists all available resources in the repository, either from local CSV or from the API, depending on the storage method.
- Parameters:
storage_method (str | None) – Optional override for single-call usage (‘local’ or ‘api’). If None, uses self.load_storage_method.
- Returns:
A list of all DataResource items found.
- Return type:
- load_with_bonsai_classification(storage_method: str | None = None, **kwargs) Tuple[DataFrame, dict[str, set[str]]] [source]
This method loads the selected data based on kwargs with the default BONSAI classifications. The default classifications for BONSAI are the following:
location: ISO3 flowobject: BONSAI
- load_with_classification(classifications: dict, units: list[str] | None = None, storage_method: str | None = None, **kwargs) Tuple[DataFrame, dict[str, set[str]]] [source]
loads data with a certain classificaiton. for the selected fields. Rows that can’t be automatically transformed are ignored and returned as is
- resource_exists(resource: DataResource, storage_method: str | None = None) bool [source]
Checks if the given resource already exists in the repository (locally or via the API).
- Returns:
True if exactly one matching resource is found.
False if none are found.
- Return type:
- Raises:
ValueError – If multiple matches are found or if an invalid storage method is set.
- update_resource_list(resource: DataResource, storage_method: str | None = None) None [source]
Updates an existing resource in the repository.
- Parameters:
resource (DataResource) – The resource data to update.
- dataio.resources.compare_version_strings(resource1: DataResource, resource2: DataResource)[source]
dataio.save module
Datapackage save module of dataio utility.
- dataio.save.old_save(datapackage, root_path: str = '.', increment: str = None, overwrite: bool = False, create_path: bool = False, log_name: str = None)[source]
Save datapackage from dataio.yaml file.
- Parameters:
datapackage (DataPackage) – dataio datapackage
root_path (str) – path to root of database
increment (str) – semantic level to increment, in [None, ‘patch’, ‘minor’, ‘major’]
overwrite (bool) – whether to overwrite
create_path (bool) – whether to create path
log_name (str) – name of log file, if None no log is set
- dataio.save.save_to_api(data: DataFrame, resource: DataResource, schema=None, overwrite=True, append=False)[source]
Saves the given DataFrame to resource.api_endpoint via a single JSON POST. The JSON body has the form:
- {
- “data”: [
{…}, {…}
]
}
so that multiple rows can be created in one request (per your test example).
- Parameters:
data (pd.DataFrame) – The data to be sent. Each row becomes one dict.
resource (DataResource) – Must have a non-empty ‘api_endpoint’. (Optionally add resource.id => references the ‘version’ if your endpoint requires it.)
schema (optional) – If you want to validate ‘data’ before sending, do so here.
overwrite (bool) – If your API supports ‘overwrite’, pass it as a query param or in the body (depending on your API).
- Raises:
ValueError – If ‘resource.api_endpoint’ is missing or if the POST fails.
dataio.set_logger module
Module with set_logger function.
Part of package ‘templates’ Part of the Getting the Data Right project Created on Jul 25, 2023 @author: Joao Rodrigues
- dataio.set_logger.set_logger(filename: str | Path = None, path: str = '/builds/bonsamurais/bonsai/util/dataio/docs', log_level: int = 20, log_format: str = '%(asctime)s | [%(levelname)s]: %(message)s', overwrite=False, create_path=False) None [source]
Initialize the logger.
This function creates and initializes a log file. Logging output is sent both to the file and standard output. If ‘filename’ == None, no file output is written To further write to this logger add in the script:
import logging logger = logging.getLogger(‘root’) logger.info(<info_string>) logger.warning(<warning_string>) logger.error(<error_string>)
- Parameters:
filename (str) – name of output file
path (str) – path to folder of output file
log_level (int) –
- lowest log level to be reported. Options are:
10=debug 20=info 30=warning 40=error 50=critical
log_format (str) – format of the log
overwrite (bool) – whether to overwrite existing log file
create_path (bool) – whether to create path to log file if it does not exist
dataio.tools module
- class dataio.tools.BonsaiBaseModel[source]
Bases:
BaseModel
- classmethod get_api_endpoint() Dict[str, str] [source]
Retrieves the api endpoint dictionary, hidden from serialization.
- classmethod get_classification() Dict[str, str] [source]
Retrieves the classification dictionary, hidden from serialization.
- classmethod get_csv_field_dtypes() Dict[str, Any] [source]
Return a dictionary with field names and their corresponding types. Since csv files can only contain str, float and int, all types that are not int and float will be changed to str
- classmethod get_empty_dataframe()[source]
Returns an empty pandas DataFrame with columns corresponding to the fields of the data class.
- Returns:
An empty DataFrame with columns corresponding to the fields of the data class.
- Return type:
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod to_dataclass(input_data) BonsaiTableModel [source]
- to_pandas() DataFrame [source]
Converts instances of BaseToolModel within BaseTableClass to a pandas DataFrame.
- Returns:
DataFrame containing data from instances of BaseToolModel.
- Return type:
- classmethod transform_dataframe(df: DataFrame, column_mapping: Dict[str, str] | None = None) DataFrame [source]
Transform a DataFrame into the BonsaiTable format by renaming columns and dropping others. Ensures all non-optional columns are present, and keeps optional columns if they are in the DataFrame. The column_mapping is optional. If not provided, only columns matching the schema fields are kept.
- Parameters:
- Returns:
A DataFrame with columns renamed (if mapping is provided) and unnecessary columns dropped.
- Return type:
pd.DataFrame
- Raises:
ValueError – If any required columns are missing from the DataFrame.
- class dataio.tools.BonsaiTableModel(*, data: list[BonsaiBaseModel])[source]
Bases:
BaseModel
- data: list[BonsaiBaseModel]
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- to_json()[source]
Convert the object to a JSON string representation.
- Returns:
A JSON string representing the object with data information.
- Return type:
- dataio.tools.get_dataclasses(directory: str = 'src/dataio/schemas/bonsai_api') List[str] [source]
Retrieve a list of Pydantic dataclass names that inherit from BaseToolModel from Python files in the specified directory.
- Parameters:
directory (str) – The directory path where Python files containing Pydantic dataclasses are located. Defaults to “src/dataio/schemas/bonsai_api”.
- Returns:
A list of fully qualified names (including module names) of Pydantic dataclasses that inherit from BaseToolModel.
- Return type:
List[str]
dataio.validate module
Datapackage validate module of the dataio utility.
- dataio.validate.validate(full_path: str, overwrite: bool = False, log_name: str = None) dict [source]
Validate datapackage.
Validates datapackage with metadata at: <full_path>=<root_path>/<path>/<name>.dataio.yaml
Creates <name>.validate.yaml for frictionless validation and outputs dataio-specific validation to the log.
Specific fields expected in <name>.dataio.yaml:
name : should match <name>
path : from which <path> is inferred
version
- dataio.validate.validate_matrix(df: DataFrame, schema: MatrixModel)[source]
- dataio.validate.validate_schema(resource, n_errors)[source]
Check if schema, fields, primary and foreign keys exist.
- dataio.validate.validate_table(df: DataFrame, schema: BonsaiBaseModel)[source]
Module contents
Initializes dataio Python package.
- dataio.plot(full_path: str = None, overwrite: bool = False, log_name: str = None, export_png: bool = True, export_svg: bool = False, export_gv: bool = False)[source]
Create entity-relation diagram from dataio.yaml file.
Exports .gv config file and figures in and .svg and .png format
GraphViz must be installed in computer, not only as Python package.
Structure of the output erd configuration dictionary:
First-level: key = datapackage name; value type : dictionary
Second level: keys = table name; value type : pandas.DataFrame
pandas.DataFrame index: table field names
pandas.DataFrame columns:
type: str
primary: bool
foreign: bool
field: str (field of foreign key)
table: str (table of foreign key)
datapackage: str (datapackage of foreign key)
direction: str in [‘forward’, ‘back’] (direction of arrow)
style: str in [‘invis’, ‘solid’] (style of arrow)
- Parameters:
full_path (str) – path to dataio.yaml file
overwrite (bool) – whether to overwrite output files
log_name (str) – name of log file, if None no log is set
export_png (bool) – whether to export .png graphic file
export_svg (bool) – whether to export .svg graphic file
export_gv (bool) – whether to export .gv configuration file
- Returns:
dict – erd configuration dictionary
gv – graphviz configuration object