Data Connector HF for public repo (#1331)

j3su5pro-intel · flezaalv · aagalleg · web-flow · commit a663a27de4bf · 2023-06-22T11:55:15.000-06:00
* Changes on interconnection proyect and setup * Changes to ignore copied files * Adding fix to handle encoding issues * Adding changes to readme * Taking current folder to get packages * Fix typo Signed-off-by: Felipe Leza Alvarez <109559376+flezaalv@users.noreply.github.com> * Fix typo Signed-off-by: Felipe Leza Alvarez <109559376+flezaalv@users.noreply.github.com> * fixes * Fix typo Signed-off-by: Felipe Leza Alvarez <109559376+flezaalv@users.noreply.github.com> * Inter connection sample * Hot fix for python 3.8 * Fix for setup code * ignoring sample file * Renaming correctly interconnection * renaming * Renaming correclty * Renaming again, last change delete setup * Fix on upgrade for pip, wheel and setuptools * Interoperability POC sample fix * Fix for .sample copy * Setup for bash * Hotfix: change "test" strings for "test_unittest" so that unit tests won't use files used by samples. * Fixing requirements and setup * typo fix * Removing wheel update from setup * Fixes the problem to upload dataset to GCP. * Fixed unwanted changes on main branch * Fixed unwanted changes on main branch * Fixed aws functional test name * Upload a folder using the aws connector * change on package name requires all finish edit interconnection to rename it to interopelabilityu * Last changes on Interoperability are applied * Updated Licence * Removinb code of conduct reference, we have not * Skips row 0 from excel when creating dataframe. * Refactor * Modification on refactoring * delete unused gitignore * change access keys format * Refactored names * Removing old name * first version license header * Remoivng readme files on this branch * Deleting files created from setup * headers * Remonig not implemented packages * Adding headers into main packages * Adding headers on sample code * Updating files for publishing * Get GCP credentials * Get GCP credentials * Get GCP credentials * Get GCP credentials * Changed setup.sh file * Removed init file on interoperability folder * Updating readme * Removing bad folder * Big refactoring, moving data_connector into datasets * Complete sample link * Test WF * Fixed path for unit tests * Changed trigger to PR * Create sample link * Merging readmes * Merging readmes * Merging readmes * Removing licence only for data connector * Missed recursive flag * delete unused names * Ignore outputs of jupyter notebook * Removing commented block * Removing deprecated folder * Include headers in init files * Adding headers in all tests * Removing coverage omit from tox configuration file (This only works on MZ env) * Removing empty lines * Removed gcp auth commands instead of commenting them * Adding headers * Removed sensitive information on gcp * Removing values to make it more easy for usersfill it * Solving names to public repo * Removing default values * Removing sample values * Removing values * Fixing path on script * Removing extra files * add security file * Updating files and structure for publishing * Updates for packaging * Updating readme * Updating metadata * Updating source code * Updating git ignore * Updating git ignore * updating metadata * Updating gitgitnore * removing extracted files * Updating repo * removing dataset egg info * Updating file permissions * Updating file permissions * removing key * Merging from parent * Merging from parent * Merging from parent * Merging from parent * Updating imports * Updating reame * Removing error * removing data_connector changes * removing data_connector changes * removing data_connector changes * removing data_connector changes * removing data_connector changes * removing data_connector changes * removing data_connector changes * removing data_connector changes * removing data_connector changes * Updating conda recipes * Updating conda recipes * Fixed bugs in setup.sh Added azure src and dependencies. * Removing conda folders * Updating blank space at the end of files * Updating readme * Validation/scans (#56) * Fixed dataset_api requirements file * Merging from data_connector * Updating gitignore * Fixing git ignore * Returning depencencies * Returning training code * Creating and re naming sample files * Adding format * New readme proposals * Fix on toml to avoid refactor * Readme agenda * Conda folder is unevitable * Exclud conda and egg folders * Adding badages in main readme... will see if we should use rst format for main readme only * Simple entry point for sample doc * Change header for sub_linked section * Modifications to current lass invocation * Adding relative link to documentation in AWS main readme file * Terms and conditions requirements update * Changes on Azure Readmi file * Removing previous terms and conditions * Updating path for datasets_urls * Updating path for datasets_urls * Removing data connector changes * Updating blank last line * Updated documentation with curren code functionality * Update documentation * Added code sample for upload, download and list blobs for oauth * first definition on dcp readme for bigquery * Sample connection with oauth * Adding readme sample for gcp service account connection with GCP * Connection documentation finished * Updating TPP file * updating with feedback * updating with feedback * Updating with feedback * Updating with feedback * Restoring lost changes for conda recipes * Updating conda recipes * Updating conda recipes * Updating conda recipe * Updating conda description * Updating changes from data_connector * Updating conda recipe * Hot fix for bad import * hotfix, binary storage stream downloaded package should write as binary on files * Updating git ignore * patching data connector to 1.0.1 * patching data connector to 1.0.1 * Updating recipes * Fix for toml where package build * Ignoring build folder * toml file is alwais included, have not sense exclude it * Fix on typo at conda install command * Fix on Apache version name * Fixing typo on conda descripion * Fixing typo on conda recipe meta.yaml * removing spaces for consistency --------- Signed-off-by: Felipe Leza Alvarez <109559376+flezaalv@users.noreply.github.com> Co-authored-by: Felipe Leza Alvarez <109559376+flezaalv@users.noreply.github.com> Co-authored-by: aagalleg <alberto.gallegos.muro@intel.com> Co-authored-by: Gerardo Dominguez <gerardo.dominguez.aldama@intel.com> Co-authored-by: Leza Alvarez, Felipe <felipe.leza.alvarez@intel.com> Co-authored-by: Miguel Pineda <miguel.pineda.juarez@intel.com> Co-authored-by: ma-pineda <110496466+ma-pineda@users.noreply.github.com> Co-authored-by: gera-aldama <111396864+gera-aldama@users.noreply.github.com>
diff --git a/datasets/data_connector/.gitignore b/datasets/data_connector/.gitignore
@@ -18,5 +18,7 @@ inspect_package-pip/
 build/
 
 # conda
+conda/local-channel/
 conda/local_channel/
-conda/extracted
+conda/extracted/
+conda/build/
diff --git a/datasets/data_connector/conda/conda_recipe/meta.yaml b/datasets/data_connector/conda/conda_recipe/meta.yaml
diff --git a/datasets/data_connector/conda/description.md b/datasets/data_connector/conda/description.md
@@ -0,0 +1,21 @@
+Data connector is a tool to connect to AzureML, Azure blob, GCP storage, GCP Big Query and AWS storage S3. The goal is provide all cloud managers in one place and provide documentation for an easy integration.
+
+***
+
+### Prerequisites 
+Have either python  `3.8`, `3.9` or `3.10` already installed.
+
+***
+
+### Installation Command 
+```bash
+conda install cloud-data-connector -c microsoft -c intel -c conda-forge 
+```
+
+***
+
+### PyPI Package
+[Here](https://pypi.org/project/cloud-data-connector/)
+
+***
+
diff --git a/datasets/data_connector/conda/pacakges.yaml b/datasets/data_connector/conda/pacakges.yaml
diff --git a/datasets/data_connector/conda/recipe/meta.yaml b/datasets/data_connector/conda/recipe/meta.yaml
@@ -0,0 +1,56 @@
+{% set name = "cloud-data-connector" %}
+{% set version = "1.0.1" %}
+
+package:
+  name: {{ name|lower }}
+  version: {{ version }}
+
+source:
+  url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/cloud_data_connector-{{ version }}.tar.gz
+  sha256: c0e23333b9d3b021a94516dc4a67b47abbdd18c5f011ceec834acd2b673988a5
+
+build:
+  noarch: python
+  script: |
+    {{ PYTHON }} -m pip install . -vv
+  number: 0
+
+requirements:
+  host:
+    - python>=3.8,<3.11
+    - setuptools>=61.0
+    - setuptools-scm
+    - pip
+  run:
+    - python>=3.8,<3.11
+    # - azureml>=0.2.7 # not available in conda
+    - azure-ai-ml>=2023.06.01 # microsoft only
+    - azure-core>=2023.06.01
+    - azure-identity>=2023.06.01
+    - azure-storage-blob>=1.4.1
+    # - azureml-core>=1.49.0 # not available in conda
+    - boto3>=1.26.154
+    - google-api-core>=2.0.0
+    - google-auth>=1.33.0
+    - google-auth-oauthlib>=0.4.1
+    - google-cloud-bigquery>=2.1.0
+    - google-cloud-storage>=2.1.0
+    - packaging>=21.3
+    - python-dotenv>=1.0.0
+
+test:
+  imports:
+    - data_connector
+
+about:
+  summary: 'Data connector is a tool to connect to AzureML, Azure blob, GCP storage, GCP Big Query and AWS storage S3. The goal is provide all cloud managers in one place and provide documentation for an easy integration.'
+  license: 'Apache License, Version 2.0'
+  about_license_url: https://www.apache.org/licenses/LICENSE-2.0.html
+
+extra:
+  recipe-maintainers:
+    - Jose de Jesus Herrera Ledon <jesus.herrera.ledon@intel.com>
+    - Alberto Gallegos Muro <alberto.gallegos.muro@intel.com>
+    - Felipe Leza Alvarez <felipe.leza.alvarez@intel.com>
+    - Miguel Pineda Juarez <miguel.pineda.juarez@intel.com>
+    - Gerardo Dominguez Aldama <gerardo.dominguez.aldama@intel.com>
diff --git a/datasets/data_connector/data_connector/azure/downloader.py b/datasets/data_connector/data_connector/azure/downloader.py
@@ -47,7 +47,7 @@ def download(
             storage_stream_downloader = blob_container_client.download_blob(
                 data_file
             ).readall()
-            with open(destiny, mode="w") as downloaded_blob:
+            with open(destiny, mode="wb") as downloaded_blob:
                 downloaded_blob.write(storage_stream_downloader)
             self.container_client = blob_container_client
             return blob_container_client
diff --git a/datasets/data_connector/pyproject.toml b/datasets/data_connector/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "data_connector"
-version = "1.0.0"
+version = "1.0.1"
 requires-python = ">=3.8,<3.11"
 authors = [
     { name="IntelAI", email="IntelAI@intel.com"}
@@ -23,22 +23,22 @@ classifiers = [
 dependencies = [
     "azureml>=0.2.7",
     "azure-ai-ml>=1.4.0",
-    "azure-storage-blob>=12.14.1",
+    "azureml-core>=1.49.0",
     "azure-identity>=1.12.0",
+    "azure-storage-blob>=1.4.1",
     "azure-core>=1.26.3",
-    "azureml-core>=1.49.0",
-    "boto3>=1.26.65",
-    "google-api-core>=2.11.0",
-    "google-auth>=2.16.2",
-    "google-auth-oauthlib>=1.0.0",
-    "google-cloud-bigquery>=3.7.0",
-    "google-cloud-storage>=2.7.0",
-    "packaging<22.0,>=20.0",
+    "boto3>=1.26.154",
+    "google-api-core>=2.0.0",
+    "google-auth>=1.33.0",
+    "google-auth-oauthlib>=0.4.1",
+    "google-cloud-bigquery>=2.1.0",
+    "google-cloud-storage>=2.1.0",
+    "packaging>=21.3",
     "python-dotenv>=1.0.0"
 ]
 
 [tool.setuptools.packages.find]
-where = ["data_connector"]  # list of folders that contain the packages (["."] by default)
+where = ["."]  # list of folders that contain the packages (["."] by default)
 include = ["data_connector*"]
-exclude = ["data_connector.egg-info", "pyproject.toml"]
+exclude = ["data_connector.egg-info"]
 namespaces = false
diff --git a/datasets/data_connector/samples/gcp/bigquery.py b/datasets/data_connector/samples/gcp/bigquery.py
@@ -22,6 +22,7 @@
 from data_connector.gcp.query import Query
 from dotenv import load_dotenv
 from google.cloud import bigquery
+from google.api_core.exceptions import BadRequest
 
 load_dotenv()