# Open Heritage Data Entry
Tools to handle data from the Open Heritage and Datacite database.

# Table of Contents
- [Installation](#installation)
- [How to use](#how-to-use)
  - [Setup](#setup)
  - [DOI Full Upload](#doi-full-upload)
  - [MySQL Uploader](#mysql-uploader)
  - [DOI Uploader](#doi-uploader)
  - [XML Generator](#xml-generator)
  - [Database Parser](#database-parser)
  - [Link Checker](#link-checker)
  - [Project Datasets Checker](#project-datasets-checker)
  - [Google Bucket Checker](#google-bucket-checker)
  - [Google Cloud to Dropbox File Transfer](#google-cloud-to-dropbox-file-transfer)
  - [Metadata Survey Filler](#metadata-survey-filler)
- [How it works](#how-it-works)
  - [Datacite API](#datacite-api)
  - [XML Files](#xml-files)
  - [Open Heritage Database](#open-heritage-database) 
  - [Google Cloud Storage API](#google-cloud-storage-api)
  - [File Transfer](#file-transfer)
  - [Loading Dictionaries](#loading-dictionaries)
- [Troubleshooting](#troubleshooting)
- [Resources](#resources)


## Installation
1. Clone this repository: git clone https://github.com/ilayda-cyark/openHeritage_data_entry_tool.git
2. Ensure python3.10 is installed.
3. cd to repository

`cd path/to/openHeritage_data_entry_tool`

4. Install required modules. Note: if you do not want to install all the required python modules, the requirements.txt 
   file contains a list of required python modules for each individual script.
  
`pip install -r requirements.txt`
 
5. Ensure that HeidiSQL is installed and the Open Heritage Database is accessible.


## How to use
This section includes setup instructions and how to use each of the scripts.

The first section lists how to fully set up the environment for all the scripts. The requirements.txt file contains a 
list of required python modules for each individual script.

The following sections then list individual instructions for each script, which include a brief description of the 
script does, the minimum setup for the script and a command line example of how to use the script. The 
command line example details any arguments that may be used and the meaning of each argument.


### Setup
1) Open the Open Heritage Heidi SQL database and download the "Organizations", "Projects_MASTER", 
and "Project_Entities" tables as csv files. Move the csv files into the "CSV_files" directory.

2) Also download the "Datasets" table as a csv files. Move the csv file into the "CSV_files" directory, so that the file 
structure looks like the following.
```
|--openHeritage_data_entry_tool
    |--.gitignore
    |-- README.md
    |-- bucket_checker.py
    |-- database_parser.py
    |-- doi_full_upload.py
    |-- doi_uploader.py
    |-- file_transfer.py
    |-- link_checker.py
    |-- metadata_survey_filler.py
    |-- mysql_uploader.py
    |-- project_datasets_checker.py
    |-- xml_generator.py
    |-- CSV_files
        |-- Camera_Name_To_Device_Type_Library.csv
        |-- Dataset.csv
        |-- Model_To_Camera_Name_Library.csv
        |-- OH_Metadata_Survey_Template.csv
        |-- Organizations.csv
        |-- Project_Entities.csv
        |-- Projects_MASTER.csv
```   

3) For Google Bucket Checker and File Transfer do the following:
    1) Go to the Google Cloud Platform: https://console.cloud.google.com/ (A Google account with editor permissions is 
       necessary).
    2) On the top header, choose "cyark-data-platform" as the project.
    3) On the left panel, hover over the "IAM & Admin" tab and click the "Service Accounts" option.
    4) In the list of 'Service accounts for project "cyark-data-platform"', click the 
       "cyark-data-platform@appspot.gserviceaccount.com" account link.
    5) Navigate to the "Keys" tab.
    6) Click "Add Key" and choose the JSON option. Click "Create". A .json file will be downloaded.
    7) Save this .json file in a safe location and note the path to the file.
    8) Refer to this Google Cloud Documentation: 
       https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication for setting up 
       authentication. Read the section about using Linux or macOS/Windows.


### DOI Full Upload
| Section | Details | 
| ------------- | ------------- |
| Script Name | doi_full_upload.py |
| Description | Uploads new DOI entries to the Open Heritage Database and Datacite | 
| Requirements | Step 5 of the Installation section |
| Command Line Example | `doi_full_upload.py [target_path.csv]`  |
| Example Explanation | [target_path.csv] is the path to either the CSV of the new entry or directory of multiple CSV new entries, which will be added to the Open Heritage Database and Datacite. |
| Extra Information | If [target_path.csv] is a directory, then it will ignore non-CSV files.  |


### MySQL Uploader
| Section | Details | 
| ------------- | ------------- |
| Script Name | mysql_uploader.py |
| Description | Adds a new entry or updates existing entries in the Open Heritage HeidiSQL Database. This tool only works in new entries, you can't use this the update a dataset. | 
| Requirements | Step 5 of the Installation section |
| Command Line Example | `mysql_uploader.py [new_entry.csv]`  |
| Example Explanation | [new_entry.csv] is the path to the CSV of the new entry, to add the new entry to the Open Heritage Database. |
| Extra Information | If you need a list of files in a folder use the following command in the command prompt: `--dir /s/b *.csv >list.txt`. |


### DOI Uploader
| Section | Details | 
| ------------- | ------------- |
| Script Name |  |
| Description | Updates Datacite with local csv data. | 
| Requirements | Step 1 of the Setup section. Scripts: xml_generator.py and database_parser.py. |
| Command Line Example | `python3 doi_uploader.py [optional new_entry.csv]`  |
| Example Explanation | Without [new entry.csv] the script will update Datacite with the data contained within "Organizations.csv", "Project_Entities.csv", and "Projects_Master.csv". The inclusion of [new_entry.csv] adds the new entry to Datacite. This will create a new DOI for the project, which will also be written into the new entry file. |


### XML Generator
| Section | Details | 
| ------------- | ------------- |
| Script Name |  |
| Description | Generates xml files, for use in uploading the metadata to Datacite, of the data in "Organizations.csv", "Project_Entities.csv", and "Projects_Master.csv". | 
| Requirements | Step 1 of the Setup section. |
| Command Line Example | `python3 xml_generator.py [optional destination_dir]` |
| Example Explanation | [destination_dir] is the path to the desired directory of the xml files. If omitted, files will be placed in a default location |
| Extra Information | The script will create the destination directories if they do not exist at the time of execution. |

### Database Parser
This is not a stand-alone script, but rather is a function library that provides functions that can parse and process the "Organizations.csv", "Project_Entities.csv", and "Projects_Master.csv" files. It is used in the xml_generator.py and doi_uploader.py
| Section | Details | 
| ------------- | ------------- |
| Script Name | database_parser.py |
| Description | This is not a stand alone script, but rather is a function library that provides functions that can parse and process the "Organizations.csv", "Project_Entities.csv", and "Projects_Master.csv" files. It is used in the xml_generator.py and doi_uploader.py |

### Link Checker
This script reads the list of DOIs from "Projects_MASTER.csv" and tests each of the DOIs' url (i.e. https://doi.org/10.26301/"DOI"). It then prints the list of invalid DOIs and the count of the list, as well as outputing a file named "invalid_dois.csv" containing the list.

| Section | Details | 
| ------------- | ------------- |
| Script Name | link_checker.py |
| Description | This script reads the list of DOIs from "Projects_MASTER.csv" and tests each of the DOIs' url (i.e. https://doi.org/10.26301/"DOI"). It then prints the list of invalid DOIs and the count of the list, as well as outputing a file named "invalid_dois.csv" containing the list. | 
| Requirements | Step 1 of the Setup section. |
| Command Line Example | `python3 link_checker.py`  |

### Project Datasets Checker
| Section | Details | 
| ------------- | ------------- |
| Script Name | project_datasets_checker.py |
| Description | Generates a csv file, which contains a list of DOI project that do not have a corresponding Dataset entry. | 
| Requirements | Step 2 of the Setup section. |
| Command Line Example | `python3 project_datasets_checker.py`  |

### Google Bucket Checker
| Section | Details | 
| ------------- | ------------- |
| Script Name | bucket_checker.py |
| Description | Generates two csv files: 1) A list of folder names which are invalid. 2) A list of invalid file paths with their corresponding problem. | 
| Requirements | Step 3 of the Setup section. |
| Command Line Example | `python3 bucket_checker.py` |

### Google Cloud to Dropbox File Transfer
| Section | Details | 
| ------------- | ------------- |
| Script Name | file_transfer.py |
| Description | Transfers files from the Google Cloud repository to Dropbox repository. | 
| Requirements | Step 3 of the Setup section. |
| Command Line Example | `python3 file_transfer.py [optional -a or --All] [Batch Size] [Temp Storage Path]` |
| Example Explanation | [Batch Size] is the maximum download size in GB. So, this process will take up at most [Batch Size] in a hard drive at any given time. [Temp Storage Path] is the path to directory where the files will be temporarily stored before being uploaded to Dropbox and subsequently deleted. The inclusion of the -a or --All option means that program will attempt to transfer all of the files, regardless if they have already been transferred. Otherwise, the program will only transfer files that are either missing in Dropbox or have a different file size. |
| Extra Information | After each batch is uploaded, the contents of the [Temp Storage Path] directory is completely deleted. Do not use a directory with valuable data that you wish to keep in the system. |

### Metadata Survey Filler
| Section | Details | 
| ------------- | ------------- |
| Script Name | metadata_survey_filler.py |
| Description | Generates a partially filled survey csv file. | 
| Requirements | The Make_And_Model_To_Device_Type_Library.csv and Model_To_Camera_Library.csv file.  |
| Command Line Example | `python3 metadata_survey_filler.py [doi_data_folder_path] [filled_survey_path]` |
| Example Explanation | [doi_data_folder_path] is the path to the folder of metadata. [filled_survey_path] is where the filled survey will be created. |
| Extra Information | Some filled data depends on files which contain dictionaries. See [Loading Dictionaries](#loading-dictionaries) below. |

## How it works

### Datacite API

The DOI Uploader script uses the Python requests module to interact DataCite's REST API to upload the DOI metadata automatically. The following are the main components needed to send requests:
1. Request type - Get, Post, Put, etc. 
* Get - used to retrieve the metadata of a list of DOIs or a specific DOI.
* Post - used to create a new DOI. If DOI is not supplied then DataCite will create a random new DOI. If DOI is supplied, it must new available for DataCite to create it.
* Put - used to update an existing DOI. A specific DOI must be supplied to tell DataCite which DOI to update.

`requests.[request type]()` where [request type] is replaced with get, post, put, etc.

2. Request URL - https://api.datacite.org/dois or https://api.datacite.org/dois/10.26301/ (sometimes with a DOI appended)

`requests.[request type]([url])` where [url] is the REST API url.

3. Request Headers - dictionary {"Content-Type": "application/vnd.api+json"}

`requests.[request type]([url], headers=[Request Headers])` where [Request Headers] equals the above dictionary.

4. Authentication - tuple ([Username], [Password]) where [Username] and [Password] are the DataCite account credentials.

`requests.[request type]([url], headers=[Request Headers], auth=[Authentication])` where [Authentication] equals the above tuple.

5. Data - Json of the data to be sent to DataCite. This not necessary with a Get request. Use the DataCite REST API documentation for the required json format. The actual metadata is generated from the XML Generator script, which is then base 64 encoded in the a string, using the "ascii" and "replace" options.
```
xml_str = xml_generator.generate_xml()
encoded_xml = base64.b64encode(xml_str.encode("ascii", "replace"))
data = json.dumps(generate_json(doi, encoded_xml))
```
Then the full request would be:

`requests.[request type]([url], headers=[Request Headers], auth=[Authentication], data=[Data])` where [Data] equals the above data example. 


### XML Files
The xml generator will generator text or file in the following format:
```
<?xml version="1.0" encoding="UTF-8"?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
	<creators>
		<creator>
			<creatorName nameType="[Contributor]">Calidos</creatorName>
			<nameIdentifier nameIdentifierScheme="Other" schemeURI="">[Contributor Url]</nameIdentifier>
		</creator>
		<creator>
			<creatorName nameType="[Creator 1]">Calidos</creatorName>
			<nameIdentifier nameIdentifierScheme="Other" schemeURI="">[Creator 1 Url]</nameIdentifier>
		</creator>
	</creators>
	<titles>
		<title titleType="Other">[Project Title]</title>
	</titles>
	<publisher>[Publisher]</publisher>
	<publicationYear>OpenHeritage3D</publicationYear>
	<resourceType resourceTypeGeneral="Dataset">Dataset</resourceType>
	<subjects>
		<subject schemeURI="http://www.oecd.org/science/inno" valueURI="http://www.oecd.org/science/inno/38235147.pdf">[Subject 1]</subject>
		<subject schemeURI="http://www.oecd.org/science/inno" valueURI="http://www.oecd.org/science/inno/38235147.pdf">[Subject 2]</subject>
		<subject schemeURI="http://www.oecd.org/science/inno" valueURI="http://www.oecd.org/science/inno/38235147.pdf">[Subject 3]</subject>
	</subjects>
	<dates>
		<date dateType="Collected" dateInformation="Start">[Start Date]</date>
		<date dateType="Collected" dateInformation="End">[End Date]</date>
		<date dateType="Submitted" dateInformation="Publish">[Submitted Date]</date>
	</dates>
	<sizes/>
	<formats/>
	<version/>
	<rightsList>
		<rights rightsURI="[Rights URL]">[Rights Name]</rights>
	</rightsList>
	<descriptions>
		<description descriptionType="Abstract">[Description]</description>
	</descriptions>
	<geoLocations>
		<geoLocation>
			<geoLocationPoint>
				<pointLatitude>[Latitude]</pointLatitude>
				<pointLongitude>[Longitude]</pointLongitude>
			</geoLocationPoint>
		</geoLocation>
	</geoLocations>
</resource>
```
The tags that have a plural name may take multiple entries, and the number of entries is variable between each DOI project. Only the fields, designated with [Field Name], can have different values. Everything else is constant between all of the xml files. The following is how the csv fields relate to the xml fields:

| XML Field | Description | New Entry Field | Open Heritage Field | 
| ------------- | ------------- | ------------- | ------------- |
| Creators | This field contains the data of 1 or more creators. A creator which has been labeled a "Contributor" will be listed first. | The repeating series of organizationName, organizationURL, and entityType fields | |
| Creator |  The data on a single organization. | organizationName and organizationURL; entityType denotes if the organization is a "Contributor". | Organization Table: organizationName, organizationURL, and Contributor. How these fields are specifically connected to each Project is in the "Open Heritage Section" |
| Project Title | Project Name (primary title) | project_name | Projects_MASTER Table: project_name |
| Publication Year | Year of publication on Open Heritage 3D | Year value in publish_data | Year value in Projects_MASTER Table: publish_data |
| Subjects | Contains a list of Subjects related to the project | comma separated list in keywords | comma separated list in Projects_MASTER Table: keywords |
| Start Date | Date of the start of data collection the site in MM//DD/YYYY | collection_date_start | Projects_MASTER Table: collection_date_start |
| End Date | Date of the conclusion of data collection the site in MM//DD/YYYY | collection_date_end |  Projects_MASTER Table: collection_date_end |
| Publish Date | Date of publication on Open Heritage 3D in MM//DD/YYYY | publish_date | Projects_MASTER Table: publish_date |
| RightsList | Can contain more that one entry, but most projects only have one entry listed. Is comprised of the Name of the Right and the URL to the description. | license_type and license_link | Projects_MASTER Table: license_type and license_link |
| Description | Contains the project description (the field capture methodology including data capture device(s) and project goals), site description (the cultural significance of the site and a brief description of the site history. Please note any known cultural sensitivities associated with the data), external project link, and additional info link | project_description, site_description, external_project_link and additional_info_link | Projects_MASTER Table: project_description, site_description, external_project_link and additional_info_link |
| Latitude | Latitude at centroid of site in decimal degrees | latitude | Projects_MASTER Table: latitude |
| Longitude | Longitude at centroid of site in decimal degrees | longitude | Projects_MASTER Table: longitude |


### Open Heritage Database

The Database Parser script parses four tables (csv): Organizations, Projects_MASTER, Project_Entities, and Datasets. Projects_MASTER contains most of the Project metadata, except for the related Organizations and Data Devices. The Projects_Entities table maps a list of DOI/Project Name, of which there can be repeats, to an Organization's ID as well as their role (Authority, Collector, Funder, Partner, or Contributor) in the project. Then these Organization IDs can be mapped to an Organization in the Organizations table. For example, this is common sequence of commands to link the fields across the tables:

`doi_to_ids_dict = database_parser.get_doi_to_ids_dict(project_entities_path)`

This returns a dictionary, where unique DOIs are mapped to a list of ids ((str) DOI -> (list)[(str) id1, (str) id2, ...])

`id_to_organization_dict = database_parser.get_id_to_organization_dict(organizations_path)`

This returns a dictionary, where unique Organization IDs are mapped to an Organization ((str) ID -> (list)[(str) Organization Name, (str) OrganizationURL])

`doi_to_organizations_dict = database_parser.get_doi_to_organizations_dict(doi_to_ids_dict, id_to_organization_dict)`

This effectively conbines doi_to_ids_dict and id_to_organization_dict, by replacing the Organization ID keys in id_to_organization_dict with the DOI keys, as well as appending the list of the Organization's roles in the project. ((str) DOI -> (list)[(str) Organization Name, (str) OrganizationURL, (int) isAuthority, (int) isCollector, (int) isFunder, (int) isPartner, (int) isContributor])

Now we have the all Organization data for each DOI.

The Datasets table contains info on the datasets for each DOI. For example this is how to connect DOIs to the Dataset data.
```
# list of dois from Projects_MASTER
dois = [...]
doi_to_datasets_dict = database_parser.get_doi_to_datasets_dict("Datasets.csv")
```
This returns a dictionary, where unique DOIs are mapped to a list of Dataset ((str) DOI -> (list)[(str) project_name, (str) dataType, (str) derivativeType, (str) latitude_top_left, (str) longitude_top_left, (str) latitude_bottom_right, (str) longitude_bottom_left, (str) dataSize])

### Google Cloud Storage API
To access a Google Cloud Bucket, use the following code:
```
client = storage.Client()
bucket = client.get_bucket(BUCKET_NAME)
blobs = bucket.list_blobs()
```
BUCKET_NAME is the full name of the bucket. For example: "cyark-data-platform.appspot.com".
blobs is a list of data on each of the bucket's folders and files.

Here is more documentation on how to use the Google Cloud Storage API: https://cloud.google.com/storage/docs/listing-objects#code-samples

### File Transfer
This script uses the Google Cloud and Dropbox APIs. 

Without the -a or --All option, the program will compare the Google Cloud and Dropbox repositories and find which files from Google Cloud need to be transfered, by finding missing files or files with mismatching sizes. Sometimes, connections fail, so the program will retry the whole transfer process, only for missing/mismatched size files, until it detects that both repositories are equal.

Files are downloaded from Google Cloud and uploaded to Dropbox in batches, with the max size set by the user.
For each batch:
1. Compile a list of the files that will be included in the current batch, by accessing the size data for each file from Google Cloud.
2. Download the files in the current batch from Dropbox to the given Temporary Storage Directory. This operation uses multithreading.
3. Upload the files in the Temporary Storage Directory to Dropbox. This operation uses multithreading.
4. Delete all the files in the Temporary Storage Directory.
5. Repeat until all files in Google Cloud have been transferred.

The multithreading operations work by using simple function (upload_file() and download_file()) which are then run in parallel with each other. The default setting is 10 threads, but this is not necessary the optimal number of threads in regards to execution speed. 
  
### Loading Dictionaries
Some scripts require loading external dictionaries, which are csv files which contain data linking one value to another. Generally the first line labels what kind of values will go in each column and is used for readability. The data of such dictionaries is then loaded in the script into dict's.

To change the location of the dictionaries, change the corresponding dictionary path in the script. To change the dictionary itself, simply edit the corresponding csv file, following typical csv formatting rules.

## Troubleshooting
* If you are getting a FileNoteFoundError, make sure that each of tables are labeled correctly as specified above.
* If you are getting this error: "google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started", make sure that the Google Cloud Authentication instructions were followed properly.


## Resources
* DataCite REST API Guide: https://support.datacite.org/docs/api
* Google Cloud Storage Authentication: https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication
* Google Cloud Storage Listing Objects: https://cloud.google.com/storage/docs/listing-objects#code-samples
* Dropbox Documentation: https://www.dropbox.com/developers/documentation/python#overview
