diff --git a/01_EXTRACT_WITH_TAPS.md b/01_EXTRACT_WITH_TAPS.md new file mode 100644 index 0000000..feb711b --- /dev/null +++ b/01_EXTRACT_WITH_TAPS.md @@ -0,0 +1,117 @@ +# ๐Ÿบ All about TAPS ๐Ÿบ + +## Taps extract data from any source and write that data to a standard stream in a JSON-based format. + +Be Check out our [official](04_MAKE_IT_OFFICIAL.md) and [unofficial](03_COOL_TAPS_CLUB.md) pages before creating your own since it might save you some time in the long run. + +### Making Taps + +If a tap for your use case doesn't exist yet have no fear! This documentation will help. Let's get started: + +### ๐Ÿ‘ฉ๐Ÿฝโ€๐Ÿ’ป ๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ป Hello, world + +A Tap is just a program, written in any language, that outputs data to `stdout` according to the [Singer spec](06_SPEC.md). + +In fact, your first Tap can be written from the command line, without any programming at all: + +```bash +โ€บ printf '{"type":"SCHEMA", "stream":"hello","key_properties":[],"schema":{"type":"object", "properties":{"value":{"type":"string"}}}}\n{"type":"RECORD","stream":"hello","schema":"hello","record":{"value":"world"}}\n' +``` + +This writes the datapoint `{"value":"world"}` to the *hello* stream along with a schema indicating that `value` is a string. + +That data can be piped into any Target, like the [Google Sheets Target], over `stdin`: + +```bash +โ€บ printf '{"type":"SCHEMA", "stream":"hello","key_properties":[],"schema":{"type":"object", "properties":{"value":{"type":"string"}}}}\n{"type":"RECORD","stream":"hello","schema":"hello","record":{"value":"world"}}\n' | target-gsheet -c config.json +``` + +### ๐Ÿ๐Ÿ๐Ÿ A Python Tap + +To move beyond *Hello, world* you'll need a real programming language. Although any language will do, we have built a Python library to help you get up and running quickly. This is because Python is the defacto standard for data engineers or folks interested in moving data like yourself. + +If you need help ramping up or getting started with Python there's fantastic community support [here](https://www.python.org/about/gettingstarted/). + +Let's write a Tap called `tap_ip.py` that retrieves the current IP using icanhazip.com, and writes that data with a timestamp. + +First, install the Singer helper library with `pip`: + +```bash +โ€บ pip install singer-python +``` + +Then, open up a new file called `tap_ip.py` in your favorite editor. + +```python +import singer +import urllib.request +from datetime import datetime, timezone +``` + +We'll use the `datetime` module to get the current timestamp, the +`singer` module to write data to `stdout` in the correct format, and +the `urllib.request` module to make a request to icanhazip.com. + +```python +now = datetime.now(timezone.utc).isoformat() +schema = { + 'properties': { + 'ip': {'type': 'string'}, + 'timestamp': {'type': 'string', 'format': 'date-time'}, + }, +} + +``` + +This sets up some of the data we'll need - the current time, and the +schema of the data we'll be writing to the stream formatted as a [JSON +Schema]. + +```python +with urllib.request.urlopen('http://icanhazip.com') as response: + ip = response.read().decode('utf-8').strip() + singer.write_schema('my_ip', schema, 'timestamp') + singer.write_records('my_ip', [{'timestamp': now, 'ip': ip}]) +``` + +Finally, we make the HTTP request, parse the response, and then make +two calls to the `singer` library: + + - `singer.write_schema` which writes the schema of the `my_ip` stream and defines its primary key + - `singer.write_records` to write a record to that stream + +We can send this data to Google Sheets as an example by running our new Tap +with the Google Sheets Target: + +``` +โ€บ python tap_ip.py | target-gsheet -c config.json +``` + +Alternatively you could send it to a csv just as easy by doing this: + +``` +โ€บ python tap_ip.py | target-csv -c config.json +``` + +## To summarize the formula for pulling with a tap and sending to a target is: + +``` +โ€บ python YOUR_TAP_FILE.py -c TAP_CONFIG_FILE_HERE.json | TARGET-NAME -c TARGET_CONFIG_FILE_HERE.json +To summarize the formula for pulling with a tap and sending to a target is: + +``` +โ€บ python YOUR_TAP_FILE.py -c TAP_CONFIG_FILE_HERE.json | TARGET-TYPE -c TARGET_CONFIG_FILE_HERE.json +``` + +You might not always need config files, in which case it would just be: + +``` +โ€บ python YOUR_TAP_FILE.py | TARGET-NAME +``` + +More simply the formula is: +``` +โ€บ python YOUR_TAP_FILE.py | TARGET-TYPE +``` + +This assumes your target is intalled locally. Which you can read more about by heading over to the [targets page](02_SEND_TO_TARGETS). diff --git a/02_SEND_TO_TARGETS.md b/02_SEND_TO_TARGETS.md new file mode 100644 index 0000000..1398506 --- /dev/null +++ b/02_SEND_TO_TARGETS.md @@ -0,0 +1,49 @@ +# ๐ŸŽฏ All about TARGETS ๐ŸŽฏ + + +## Targets are very similar to TAPS in that they still adhere to the Singer spec. + +Right now there are targets made to send to a csv, Google Sheets, Magento, or Stitch but the possibilities are endless. To send your tap to a target here is an example using Google Sheets: + + +```bash +<<<<<<< HEAD +โ€บ tap-ip | target-gsheet -c config.json +``` + +Alternatively you could send it to a csv just as easy by doing this: + +```bash +<<<<<<< HEAD +โ€บ tap-ip | target-csv -c config.json +``` + +To summarize the formula for pulling with a tap and sending to a target is: + +```bash +<<<<<<< HEAD +โ€บ TAP-NAME -c TAP_CONFIG_FILE_HERE.json | TARGET-NAME -c TARGET_CONFIG_FILE_HERE.json +``` + +You might not always need config files, in which case it would just be: + +```bash +<<<<<<< HEAD +โ€บ TAP-NAME | TARGET-NAME +``` +See? Easy. + +## If you'd like to create your own TARGET it's just like building a tap. + +Essentially you consume the messages that a tap outputs and then the target determines what to do with it / where to send it. + +Prior to it being package up, in your dev environment you would run it like this +```bash +โ€บ TAP-NAME -c TAP_CONFIG_FILE_HERE.json | python my_target.py -c TARGET_CONFIG_FILE_HERE.json +``` + +Once both tap and target are bundled as packages then you can install via pip or your other fav package system & then its this: + +```bash +โ€บ TAP-NAME | TARGET-NAME +``` diff --git a/03_COOL_TAPS_CLUB.md b/03_COOL_TAPS_CLUB.md new file mode 100644 index 0000000..5407e9d --- /dev/null +++ b/03_COOL_TAPS_CLUB.md @@ -0,0 +1,63 @@ +## THE OFFICIAL (AND UNOFFICIAL) COLLECTION OF TAPS + +### Defining Official and Unofficial + +To make a tap or target official the Singer team has reviewed the work and it conforms to the best practices. Official means stamp of review & approval. It also means the repo is put into the Singer github organizaiton. For transparency the Singer team is comprised of Stitch employees. [Stitch](https://stitchdata.com) + +Regardless of being official or unofficial you can still move or pull all the data you want. *This is why unofficial taps are so important and why we value them so much!* + +### If you've created a tap or target be sure to simply submit a pull request to our unofficial table and show the world your work. +Also be sure to drop us a line so we can send you *EPIC SWAG* ๐ŸŽ + +Without further ado here are the unofficial taps: + + +| TYPE | NAME + REPO | USER ๐Ÿ‘จ๐Ÿฝโ€๐Ÿ’ป ๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป ๐Ÿ‘‘ | +| -------- |----------------------------------------------------------------------------|------------------------------------------------------------------| +| tap | [tap-csv](https://github.com/robertjmoore/tap-csv) | [robertjmoore](https://github.com/robertjmoore/) | +| tap | [tap-clubhouse](https://github.com/envoy/tap-clubhouse) | [envoy](https://github.com/envoy) | +| tap | [singer-airtable](https://github.com/StantonVentures/singer-airtable) | [StantonVentures](https://github.com/stantonventures) | +| tap | [tap-s3-csv](https://github.com/fishtown-analytics/tap-s3-csv) | [fishtown-analytics](https://github.com/fishtown-analytics) | +| tap | [tap-jsonfeed](https://github.com/briansloane/tap-jsonfeed) | [briansloane](https://github.com/briansloane/) | +| tap | [tap-csv](https://github.com/robertjmoore/tap-csv robertjmoore) | [robertjmoore](https://github.com/robertjmoore/) | +| tap | [tap-clubhouse](https://github.com/envoy/tap-clubhouse) | [envoy](https://github.com/envoy) | +| tap | [tap-shippo](https://github.com/robertjmoore/tap-shippo) | [robertjmoore](https://github.com/robertjmoore/) | +| tap | [singer-airtable](https://github.com/StantonVentures/singer-airtable) | [StantonVentures](https://github.com/stantonventures) | +| tap | [tap-s3-csv](https://github.com/fishtown-analytics/tap-s3-csv) | [fishtown-analytics](https://github.com/fishtown-analytics) | +| tap | [tap-jsonfeed](https://github.com/briansloane/tap-jsonfeed) | [briansloane](https://github.com/briansloane/) | +| tap | [tap-reviewscouk](https://github.com/onedox/tap-reviewscouk) | [ondex](https://github.com/onedox) | +| tap | [tap-fake-users](https://github.com/bengarvey/tap-fake-users) | [bengarvey](https://github.com/bengarvey) | +| tap | [tap-awin](https://github.com/onedox/tap-awin) | [onedox](https://github.com/onedox) | +| tap | [marvel-tap](https://github.com/ashhath/marvel-tap) | [ashhath](https://github.com/ashhath) | +| tap | [tap-mixpanel](https://github.com/Kierchon/tap-mixpanel) | [kierchon](https://github.com/kierchon) | +| tap | [tap-appsflyer](https://github.com/ezcater/tap-appsflyer) | [ezcater](https://github.com/ezcater) | +| tap | [tap-fullstory](https://github.com/expectedbehavior/tap-fullstory) | [expectedbehavior](https://github.com/expectedbehavior) | +| tap | [stitch-stream-deputy](https://github.com/DeputyApp/stitch-stream-deputy) | [deputyapp](https://github.com/deputyapp) | + +And then just in case here's a tidy list of the official ones integrated with and supported by Stitch: + +<<<<<<< HEAD +| TYPE | NAME + REPO | CONTRIBUTOR | +| -------- |-----------------------------------------------------------------------------|------------------------------------------------------------------| +| tap | [Hubspot](https://github.com/singer-io/tap-hubspot) | [Stitch Data](https://stitchdata.com) | +| tap | [Marketo](https://github.com/singer-io/tap-marketo) | [Stitch Data](https://stitchdata.com) | +| tap | [Shippo](https://github.com/singer-io/tap-shippo) | [Robert J Moore](https://github.com/robertjmoore/) | +| tap | [GitHub](https://github.com/singer-io/tap-github) | [Stitch Data](https://stitchdata.com) | +| tap | [Close.io](https://github.com/singer-io/tap-closeio) | [Stitch Data](https://stitchdata.com) | +| tap | [Referral SaaSquatch](https://github.com/singer-io/tap-referral-saasquatch) | [Stitch Data](https://stitchdata.com) | +| tap | [Freshdesk](https://github.com/singer-io/tap-freshdesk) | [Stitch Data](https://stitchdata.com) | +| tap | [Braintree](https://github.com/singer-io/tap-braintree) | [Stitch Data](https://stitchdata.com) | +| tap | [GitLab](https://github.com/singer-io/tap-gitlab) | [Stitch Data](https://stitchdata.com) | +| tap | [Wootric](https://github.com/singer-io/tap-wootric) | [Stitch Data](https://stitchdata.com) | +| tap | [Fixer.io](https://github.com/singer-io/tap-fixerio) | [Stitch Data](https://stitchdata.com) | +| tap | [Outbrain](https://github.com/singer-io/tap-outbrain) | [Fishtown Analytics](https://github.com/fishtown-analytics) | +| tap | [Harvest](https://github.com/singer-io/tap-harvest) | [Facet Interactive](https://github.com/facetinteractive) | +| tap | [Taboola](https://github.com/singer-io/tap-taboola) | [Fishtown Analytics](https://github.com/fishtown-analytics) | +| tap | [Facebook](https://github.com/singer-io/tap-facebook) | [Stitch Data](https://stitchdata.com) | +| tap | [Google AdWords](https://github.com/singer-io/tap-adwords) | [Stitch Data](https://stitchdata.com) | +| tap | [Fullstory](https://github.com/singer-io/tap-fullstory) | [Expected Behavior](https://github.com/expectedbehavior) | +| target | [Stitch](https://github.com/singer-io/target-stitch) | [Stitch Data](https://stitchdata.com) | +| target | [CSV](https://github.com/singer-io/target-csv) | [Stitch Data](https://stitchdata.com) | +| target | [Google Sheets](https://github.com/singer-io/target-gsheet) | [Stitch Data](https://stitchdata.com) | +| target | [Magento BI](https://github.com/robertjmoore/target-magentobi) | [Robert J Moore](https://github.com/robertjmoore/) | + diff --git a/04_MAKE_IT_OFFICIAL.md b/04_MAKE_IT_OFFICIAL.md new file mode 100644 index 0000000..6c8d293 --- /dev/null +++ b/04_MAKE_IT_OFFICIAL.md @@ -0,0 +1,18 @@ +# BECOME OFFICIALLY COOL + +So you've built a tap or a target have you? We think that's pretty groovy. To submit a tap for integration with Stitch an become official we ask that they follow a set standard. If you're interested in submitting to be an official tap we're mighty obliged and created a checklist so you can increase your chances of integration. + +### Check out the [BEST PRACTICES](05_BEST_PRACTICES.md) doc which will have all the instructions and way more in depth details of the following: +- [ ] Your work has a `start_date` field in the config +- [ ] Your work accepts a `user_agent` field in the config +- [ ] Your work respects API rate limits +- [ ] Your work doesn't impose memory constraints +- [ ] Your dates are all in RFC3339 format +- [ ] All states are in date format +- [ ] All data is streamed in ascending order if possible +- [ ] Your work doesn't contain any sensitive info like API keys, client work, etc. +- [ ] Please keep your schemas stored in a schema folder +- [ ] You've tested your work +- [ ] Please run pylint on your work +- [ ] Your work shows metrics +- [ ] Message [@BrianSloan](brian@stitchdata.com) or [@Ash_Hathaway](ashley@stitchdata.com) or reach out to them on [Slack](https://singer-slackin.herokuapp.com/) and let them know you'd like some swag, please. diff --git a/BEST_PRACTICES.md b/05_BEST_PRACTICES.md similarity index 99% rename from BEST_PRACTICES.md rename to 05_BEST_PRACTICES.md index 2bbffa8..12e4f5c 100644 --- a/BEST_PRACTICES.md +++ b/05_BEST_PRACTICES.md @@ -1,5 +1,4 @@ -Best Practices for Building a Singer Tap -============================================ +# BEST PRACTICES Language -------- diff --git a/SPEC.md b/06_SPEC.md similarity index 66% rename from SPEC.md rename to 06_SPEC.md index 7961874..a7d073c 100644 --- a/SPEC.md +++ b/06_SPEC.md @@ -130,7 +130,7 @@ Example: SCHEMA messages describe the datatypes of data in the stream. They must have the following properties: - - `schema` **Required**. A [JSON Schema] describing the + - `schema` **Required**. A [JSON Schema](http://json-schema.org/) describing the `data` property of RECORDs from the same `stream` - `stream` **Required**. The string name of the stream that this @@ -188,3 +188,69 @@ should be a new MINOR version. [JSON Schema]: http://json-schema.org/ "JSON Schema" [Semantic Versioning]: http://semver.org/ "Semantic Versioning" + + + +# Data Types and Schemas + +JSON is used to represent data because it is ubiquitous, readable, and +especially appropriate for the large universe of sources that expose data +as JSON like web APIs. However, JSON is far from perfect: + + - it has a limited type system, without support for common types like + dates, and no distinction between integers and floating point numbers + + - while its flexibility makes it easy to use, it can also cause + compatibility problems + +*Schemas* are used to solve these problems. Generally speaking, a schema +is anything that describes how data is structured. In Streams, schemas are +written by streamers in *SCHEMA* messages, formatted following the +[JSON Schema](http://json-schema.org/) spec. + +Schemas solve the limited data types problem by providing more information +about how to interpret JSON's basic types. For example, the [JSON Schema] +spec distinguishes between `integer` and `number` types, where the latter +is appropriately interpretted as a floating point. Additionally, it +defines a string format called `date-time` that can be used to indicate +when a data point is expected to be a +[properly formatted](https://tools.ietf.org/html/rfc3339) timestamp +string. + +Schemas mitigate JSON's compatibility problem by providing an easy way to +validate the structure of a set of data points. Streams deploys this +concept by encouraging use of only a single schema for each substream, and +validating each data point against its schema prior to persistence. This +forces the streamer author to think about how to resolve schema evolution +and compatibility questions, placing that responsibility as close to the +original data source as possible, and freeing downstream systems from +making uninformed assumptions to resolve these issues. + +Schemas are required, but they can be defined in the broadest terms - a +JSON Schema of '{}' validates all data points. However, it is a best +practice for streamer authors to define schemas as narrowly as possible. + +## Schemas in Stitch + +The Stitch persister and Stitch API use schemas as follows: + + - the Stitch persister fails when it encounters a data point that doesn't + validate against its stream's latest schema + - schemas must be an 'object' at the top level + - Stitch supports schemas with objects nested to any depth, and arrays of + objects nested to any depth - more info in the + [Stitch docs](https://www.stitchdata.com/docs/data-structure/nested-data-structures-row-count-impact) + - properties of type `string` and format `date-time` are converted to + the appropriate timestamp or datetime type in the destination database + - properties of type `integer` are converted to integer in the destination + database + - properties of type `number` are converted to decimal or numeric in the + destination database + - (soon) the `maxLength` parameter of a property of type `string` is used + to define the width of the corresponding varchar column in the + destination database + - when Stitch encounters a schema for a stream that is incompatible with + the table that stream is to be loaded into in the destination database, + it adds the data to the + [reject pile](https://www.stitchdata.com/docs/data-structure/identifying-rejected-records) + diff --git a/PROPOSALS.md b/07_PROPOSALS.md similarity index 100% rename from PROPOSALS.md rename to 07_PROPOSALS.md diff --git a/08_CODE_OF_CONDUCT.md b/08_CODE_OF_CONDUCT.md new file mode 100644 index 0000000..3facc90 --- /dev/null +++ b/08_CODE_OF_CONDUCT.md @@ -0,0 +1,81 @@ +# Code of Conduct + +## 1. Purpose + +A primary goal of Singer is to be inclusive to the largest number of contributors, with the most varied and diverse backgrounds possible. As such, we are committed to providing a friendly, safe and welcoming environment for all, regardless of gender, sexual orientation, ability, ethnicity, socioeconomic status, and religion (or lack thereof). + +This code of conduct outlines our expectations for all those who participate in our community, as well as the consequences for unacceptable behavior. + +We invite all those who participate in Singer to help us create safe and positive experiences for everyone. + +## 2. Open Source Citizenship + +A supplemental goal of this Code of Conduct is to increase open source citizenship by encouraging participants to recognize and strengthen the relationships between our actions and their effects on our community. + +Communities mirror the societies in which they exist and positive action is essential to counteract the many forms of inequality and abuses of power that exist in society. + +If you see someone who is making an extra effort to ensure our community is welcoming, friendly, and encourages all participants to contribute to the fullest extent, we want to know. + +## 3. Expected Behavior + +The following behaviors are expected and requested of all community members: + +* Participate in an authentic and active way. In doing so, you contribute to the health and longevity of this community. +* Exercise consideration and respect in your speech and actions. +* Attempt collaboration before conflict. +* Refrain from demeaning, discriminatory, or harassing behavior and speech. +* Be mindful of your surroundings and of your fellow participants. Alert community leaders if you notice a dangerous situation, someone in distress, or violations of this Code of Conduct, even if they seem inconsequential. +* Remember that community event venues may be shared with members of the public; please be respectful to all patrons of these locations. + +## 4. Unacceptable Behavior + +The following behaviors are considered harassment and are unacceptable within our community: + +* Violence, threats of violence or violent language directed against another person. +* Sexist, racist, homophobic, transphobic, ableist or otherwise discriminatory jokes and language. +* Posting or displaying sexually explicit or violent material. +* Posting or threatening to post other peopleโ€™s personally identifying information ("doxing"). +* Personal insults, particularly those related to gender, sexual orientation, race, religion, or disability. +* Inappropriate photography or recording. +* Inappropriate physical contact. You should have someoneโ€™s consent before touching them. +* Unwelcome sexual attention. This includes, sexualized comments or jokes; inappropriate touching, groping, and unwelcomed sexual advances. +* Deliberate intimidation, stalking or following (online or in person). +* Advocating for, or encouraging, any of the above behavior. +* Sustained disruption of community events, including talks and presentations. + +## 5. Consequences of Unacceptable Behavior + +Unacceptable behavior from any community member, including sponsors and those with decision-making authority, will not be tolerated. + +Anyone asked to stop unacceptable behavior is expected to comply immediately. + +If a community member engages in unacceptable behavior, the community organizers may take any action they deem appropriate, up to and including a temporary ban or permanent expulsion from the community without warning (and without refund in the case of a paid event). + +## 6. Reporting Guidelines + +If you are subject to or witness unacceptable behavior, or have any other concerns, please notify a community organizer as soon as possible. ashley at stitchdata dot com or brian at stitchdata dot com + +Additionally, community organizers are available to help community members engage with local law enforcement or to otherwise help those experiencing unacceptable behavior feel safe. In the context of in-person events, organizers will also provide escorts as desired by the person experiencing distress. + +## 7. Addressing Grievances + +If you feel you have been falsely or unfairly accused of violating this Code of Conduct, you should notify Singer with a concise description of your grievance. Your grievance will be handled in accordance with our existing governing policies. + + +## 8. Scope + +We expect all community participants (contributors, paid or otherwise; sponsors; and other guests) to abide by this Code of Conduct in all community venuesโ€“online and in-personโ€“as well as in all one-on-one communications pertaining to community business. + +This code of conduct and its related procedures also applies to unacceptable behavior occurring outside the scope of community activities when such behavior has the potential to adversely affect the safety and well-being of community members. + +## 9. Contact info + +ashley at stitchdata dot com + +## 10. License and attribution + +This Code of Conduct is distributed under a [Creative Commons Attribution-ShareAlike license](http://creativecommons.org/licenses/by-sa/3.0/). + +Portions of text derived from the [Django Code of Conduct](https://www.djangoproject.com/conduct/) and the [Geek Feminism Anti-Harassment Policy](http://geekfeminism.wikia.com/wiki/Conference_anti-harassment/Policy). + +Retrieved on November 22, 2016 from [http://citizencodeofconduct.org/](http://citizencodeofconduct.org/) diff --git a/README.md b/README.md index e4f38a1..4cf3277 100644 --- a/README.md +++ b/README.md @@ -1,274 +1,31 @@ -# Getting Started with Singer +![Singer Logo](https://trello-attachments.s3.amazonaws.com/58c8696247956895aea87ef2/58d2d15c8baaf0c33f36c87f/a789e41241329c5b972a6e105e954543/Screen_Shot_2017-04-27_at_3.58.09_PM.png) -Singer is an open source standard for moving data between databases, -web APIs, files, queues, and just about anything else you can think -of. The [Singer spec] describes how data extraction scripts โ€” called -โ€œTapsโ€ โ€” and data loading scripts โ€” called โ€œTargetsโ€ โ€” should -communicate using a standard JSON-based data format over `stdout`. By -conforming to this spec, Taps and Targets can be used in any -combination to move data from any source to any destination. +# ๐ŸŽ‰๐Ÿ‘‹๐Ÿฝ Welcome to Singer: Open-source ETL ๐ŸŽ‰๐Ÿ‘‹๐Ÿฝ -**Topics** - - [Using Singer to populate Google Sheets](#using-singer-to-populate-google-sheets) - - [Developing a Tap](#developing-a-tap) - - [Additional Resources](#additional-resources) - -## Using Singer to populate Google Sheets +## Singer is an open source ETL tool. In case you're unfamiliar ETL stands for Extract, Transform, and Load. It's a term used in the data warehousing world. If you happen to be moving data or needing to pull data we want to help! -The [Google Sheets Target] can be combined with any Singer Tap to -populate a Google Sheet with data. This example will use currency -exchange rate data from the [Fixer.io Tap]. [Fixer.io] is a free API for -current and historical foreign exchange rates published by the -European Central Bank. +### Singer sets a standard for moving data between databases, web APIs, files, queues, or just about anything else you can think of. _(Except penguins ๐Ÿง and candy ๐Ÿฌ. We haven't figured out how to move those yet unfortunately.)_ -The steps are: - 1. [Activate the Google Sheets API](#step-1---activate-the-google-sheets-api) - 1. [Configure the Target](#step-2---configure-the-target) - 1. [Install](#step-3---install) - 1. [Run](#step-4---run) - 1. [Save State (optional)](#step-5---save-state-optional) +In this documentation we'll take you through a number of scenarios. -### Step 1 - Activate the Google Sheets API +- ๐Ÿบ If you'd like to pull or extract data check out [TAPS](01_EXTRACT_WITH_TAPS.md) +- ๐ŸŽฏ If you'd like to send or load data check out [TARGETS](02_LOAD_WITH_TARGETS.md) +- ๐Ÿ“ If you want to dive in some technical goodness check out our [SPECS](07_SPEC.md) +- ๐Ÿ˜Žโœ… Once you've created your own tap or target be sure to let us know and join our kool kids [UNOFFICIAL](03_COOL_UNOFFICIAL_CLUB.md) club or learn how to submit to be part of the super cool [OFFICIAL](04_MAKE_IT_OFFICIAL.md) integrations. +- ๐Ÿ’ฏ Check this out to learn more about [BEST PRACTICES](05_BEST_PRACTICES.md) +- ๐Ÿค And above all please respect our [CODE OF CONDUCT](08_CODE_OF_CONDUCT.md) - (originally found in the [Google API - docs](https://developers.google.com/sheets/api/quickstart/python)) - - 1. Use [this - wizard](https://console.developers.google.com/start/api?id=sheets.googleapis.com) - to create or select a project in the Google Developers Console and - activate the Sheets API. Click Continue, then Go to credentials. - 1. On the **Add credentials to your project** page, click the - **Cancel** button. +### Communication +If you're feeling social we'd love to chat. Pick your poison(s): +- [Slack](https://singer-slackin.herokuapp.com/) +- [Twitter](https://twitter.com/singer_io) +- [Our Public Roadmap on Trello](https://trello.com/b/BMNRnIoU/singer-roadmap) +- Feel free to create an issue on any repo's for specific questions +- ๐Ÿฆ Carrier pigeon (beta) - 1. At the top of the page, select the **OAuth consent screen** - tab. Select an **Email address**, enter a **Product name** if not - already set, and click the **Save** button. - - 1. Select the **Credentials** tab, click the **Create credentials** - button and select **OAuth client ID**. - - 1. Select the application type **Other**, enter the name "Singer - Sheets Target", and click the **Create** button. - - 1. Click **OK** to dismiss the resulting dialog. - - 1. Click the Download button to the right of the client ID. - - 1. Move this file to your working directory and rename it - `client_secret.json`. - -### Step 2 - Configure the Target - -Created a file called `config.json` in your working directory, -following [config.sample.json](https://github.com/singer-io/target-gsheet/blob/master/config.sample.json). The required -`spreadsheet_id` parameter is the value between the "/d/" and the -"/edit" in the URL of your spreadsheet. For example, consider the -following URL that references a Google Sheets spreadsheet: - -``` -https://docs.google.com/spreadsheets/d/1qpyC0XzvTcKT6EISywvqESX3A0MwQoFDE8p-Bll4hps/edit#gid=0 -``` - -The ID of this spreadsheet is -`1qpyC0XzvTcKT6EISywvqESX3A0MwQoFDE8p-Bll4hps`. - - -### Step 3 - Install - -First, make sure Python 3 is installed on your system or follow these -installation instructions for [Mac](python-mac) or -[Ubuntu](python-ubuntu). - -`target-gsheet` can be run with any [Singer Tap] to move data from -sources like [Braintree], [Freshdesk] and [Hubspot] to Google -Sheets. We'll use the [Fixer.io Tap] - which pulls currency exchange -rate data from a public data set - as an example. - -We recommend installing each Tap and Target in a separate Python virtual -environment. This will insure that you won't have conflicting dependencies -between any Taps and Targets. - -These commands will install `tap-fixerio` and `target-gsheet` with pip in -their own virtual environments: - -```bash -# Install tap-fixerio in its own virtualenv -virtualenv -p python3 tap-fixerio -tap-fixerio/bin/pip install tap-fixerio - -# Install target-gsheet in its own virtualenv -virtualenv -p python3 target-gsheet -target-gsheet/bin/pip install target-gsheet -``` - -### Step 4 - Run - -This command will pipe the output of `tap-fixerio` to `target-gsheet`, -using the configuration file created in Step 2: - -```bash -โ€บ tap-fixerio/bin/tap-fixerio | target-gsheet/bin/target-gsheet -c config.json - INFO Replicating the latest exchange rate data from fixer.io - INFO Tap exiting normally -``` - -`target-gsheet` will attempt to open a new window or tab in your -default browser to perform authentication. If this fails, copy the URL -from the console and manually open it in your browser. - -If you are not already logged into your Google account, you will be -prompted to log in. If you are logged into multiple Google accounts, -you will be asked to select one account to use for the -authorization. Click the **Accept** button to allow `target-gsheet` to -access your Google Sheet. You can close the tab after the signup flow -is complete. - -Each stream generated by the Tap will be written to a different sheet -in your Google Sheet. For the [Fixer.io Tap] you'll see a single sheet -named `exchange_rate`. - -### Step 5 - Save State (optional) - -When `target-gsheet` is run as above it writes log lines to `stderr`, -but `stdout` is reserved for outputting **State** messages. A State -message is a JSON-formatted line with data that the Tap wants -persisted between runs - often "high water mark" information that the -Tap can use to pick up where it left off on the next run. Read more -about State messages in the [Singer spec]. - -Targets write State messages to `stdout` once all data that appeared -in the stream before the State message has been processed by the -Target. Note that although the State message is sent into the target, -in most cases the target's process won't actually store it anywhere or -do anything with it other than repeat it back to `stdout`. - -Taps like the [Fixer.io Tap] can also accept a `--state` argument -that, if present, points to a file containing the last persisted State -value. This enables Taps to work incrementally - the State -checkpoints the last value that was handled by the Target, and the -next time the Tap is run it should pick up from that point. - -To run the [Fixer.io Tap] incrementally, point it to a State file and -capture the persister's `stdout` like this: - -```bash -โ€บ tap-fixerio --state state.json | target-gsheet -c config.json >> state.json -โ€บ tail -1 state.json > state.json.tmp && mv state.json.tmp state.json -(rinse and repeat) -``` - -## Developing a Tap - -If you can't find an existing Tap for your data source, then it's time -to build your own. - -**Topics**: - - [Hello, world](#hello-world) - - [A Python Tap](#a-python-tap) - -### Hello, world - -A Tap is just a program, written in any language, that outputs data to -`stdout` according to the [Singer spec]. In fact, your first Tap can -be written from the command line, without any programming at all: - -```bash -โ€บ printf '{"type":"SCHEMA", "stream":"hello","key_properties":[],"schema":{"type":"object", "properties":{"value":{"type":"string"}}}}\n{"type":"RECORD","stream":"hello","schema":"hello","record":{"value":"world"}}\n' -``` - -This writes the datapoint `{"value":"world"}` to the *hello* -stream along with a schema indicating that `value` is a string. -That data can be piped into any Target, like the [Google Sheets -Target], over `stdin`: - -```bash -โ€บ printf '{"type":"SCHEMA", "stream":"hello","key_properties":[],"schema":{"type":"object", "properties":{"value":{"type":"string"}}}}\n{"type":"RECORD","stream":"hello","schema":"hello","record":{"value":"world"}}\n' | target-gsheet -c config.json -``` - -### A Python Tap - -To move beyond *Hello, world* you'll need a real programming language. -Although any language will do, we have built a Python library to help -you get up and running quickly. - -Let's write a Tap called `tap_ip.py` that retrieves the current - IP using icanhazip.com, and writes that data with a timestamp. - -First, install the [Singer helper library] with `pip`: - -```bash -โ€บ pip install singer-python -``` - -Then, open up a new file called `tap_ip.py` in your favorite editor. - -```python -import singer -import urllib.request -from datetime import datetime, timezone -``` - -We'll use the `datetime` module to get the current timestamp, the -`singer` module to write data to `stdout` in the correct format, and -the `urllib.request` module to make a request to icanhazip.com. - -```python -now = datetime.now(timezone.utc).isoformat() -schema = { - 'properties': { - 'ip': {'type': 'string'}, - 'timestamp': {'type': 'string', 'format': 'date-time'}, - }, -} - -``` - -This sets up some of the data we'll need - the current time, and the -schema of the data we'll be writing to the stream formatted as a [JSON -Schema]. - -```python -with urllib.request.urlopen('http://icanhazip.com') as response: - ip = response.read().decode('utf-8').strip() - singer.write_schema('my_ip', schema, 'timestamp') - singer.write_records('my_ip', [{'timestamp': now, 'ip': ip}]) -``` - -Finally, we make the HTTP request, parse the response, and then make -two calls to the `singer` library: - - - `singer.write_schema` which writes the schema of the `my_ip` stream and defines its primary key - - `singer.write_records` to write a record to that stream - -We can send this data to Google Sheets by running our new Tap -with the [Google Sheets Target]: - -```bash -โ€บ python tap_ip.py | target-gsheet -c config.json -``` - -## Additional Resources - -Join the [Singer Slack channel] to get help from members of the Singer -community. --- Copyright © 2017 Stitch - -[Singer spec]: SPEC.md -[Singer Tap]: https://singer.io -[Braintree]: https://github.com/singer-io/tap-braintree -[Freshdesk]: https://github.com/singer-io/tap-freshdesk -[Hubspot]: https://github.com/singer-io/tap-hubspot -[Fixer.io Tap]: https://github.com/singer-io/tap-fixerio -[Fixer.io]: http://fixer.io -[python-mac]: http://docs.python-guide.org/en/latest/starting/install3/osx/ -[python-ubuntu]: https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-local-programming-environment-on-ubuntu-16-04 -[Google Sheets Target]: https://github.com/singer-io/target-gsheet -[Singer helper library]: https://github.com/singer-io/singer-python -[JSON Schema]: http://json-schema.org/ -[Singer Slack channel]: https://singer-slackin.herokuapp.com/ - diff --git a/SCHEMAS.md b/SCHEMAS.md deleted file mode 100644 index 7dda1d3..0000000 --- a/SCHEMAS.md +++ /dev/null @@ -1,65 +0,0 @@ -# Data Types and Schemas - -JSON is used to represent data because it is ubiquitous, readable, and -especially appropriate for the large universe of sources that expose data -as JSON like web APIs. However, JSON is far from perfect: - - - it has a limited type system, without support for common types like - dates, and no distinction between integers and floating point numbers - - - while its flexibility makes it easy to use, it can also cause - compatibility problems - -*Schemas* are used to solve these problems. Generally speaking, a schema -is anything that describes how data is structured. In Streams, schemas are -written by streamers in *SCHEMA* messages, formatted following the -[JSON Schema] spec. - -Schemas solve the limited data types problem by providing more information -about how to interpret JSON's basic types. For example, the [JSON Schema] -spec distinguishes between `integer` and `number` types, where the latter -is appropriately interpretted as a floating point. Additionally, it -defines a string format called `date-time` that can be used to indicate -when a data point is expected to be a -[properly formatted](https://tools.ietf.org/html/rfc3339) timestamp -string. - -Schemas mitigate JSON's compatibility problem by providing an easy way to -validate the structure of a set of data points. Streams deploys this -concept by encouraging use of only a single schema for each substream, and -validating each data point against its schema prior to persistence. This -forces the streamer author to think about how to resolve schema evolution -and compatibility questions, placing that responsibility as close to the -original data source as possible, and freeing downstream systems from -making uninformed assumptions to resolve these issues. - -Schemas are required, but they can be defined in the broadest terms - a -JSON Schema of '{}' validates all data points. However, it is a best -practice for streamer authors to define schemas as narrowly as possible. - -## Schemas in Stitch - -The Stitch persister and Stitch API use schemas as follows: - - - the Stitch persister fails when it encounters a data point that doesn't - validate against its stream's latest schema - - schemas must be an 'object' at the top level - - Stitch supports schemas with objects nested to any depth, and arrays of - objects nested to any depth - more info in the - [Stitch docs](https://www.stitchdata.com/docs/data-structure/nested-data-structures-row-count-impact) - - properties of type `string` and format `date-time` are converted to - the appropriate timestamp or datetime type in the destination database - - properties of type `integer` are converted to integer in the destination - database - - properties of type `number` are converted to decimal or numeric in the - destination database - - (soon) the `maxLength` parameter of a property of type `string` is used - to define the width of the corresponding varchar column in the - destination database - - when Stitch encounters a schema for a stream that is incompatible with - the table that stream is to be loaded into in the destination database, - it adds the data to the - [reject pile](https://www.stitchdata.com/docs/data-structure/identifying-rejected-records) - - -[JSON Schema]: http://json-schema.org/