and the surprising things you can do with them
Google Spreadsheets is the Excel equivalent from Microsoft in the cloud. These spreadsheets can be filled with numbers and facts, column names can be defined and functions can be implemented. So far so standard, but doing all these things in the cloud opens up some interesting possibilities.
Documents can be accessed via various APIs and thus data can be loaded to and from these Spreadsheets. This makes the service eligible for building entire ETL processes. Looking at a cost factor of zero cents per document and ETL process this creates the equivalent of a free table storage in the cloud that can be used for private use cases or to set up entire Proof of Concepts.
Using Google Spreadsheets with Python
Python is a wildly popular programming language — especially within the data science community — and to no surprise does an API exist to connect to the Google Spreadsheet service. The package in this case is name gspread, the documentation can be found here.
The tedious but necessary process of access management in the cloud is documented in this paragraph in the official documentation. Remember to export those credentials into a JSON file to automate the authentication process and save yourself as much headache as possible.
After successfully creating the initial documents and a technical user, each document must be shared with this user before it can be manipulated via the API. The data is not accessible to anyone else and thus save in the cloud — as far as you trust Google to keep your data safe.
Think of each document as a set of tables, kinda like a schema in a SQL database. The worksheets per document become your tables in this process. Speaking names should be used (for example staging_crm, processed_sales, serving_dashboard) so that the overview never gets lost. As a disadvantage of this system primary or foreign keys can not be set and data integrity must be secured within the code if need should be.
In order to load the file into a pandas Dataframe use the gspread API as follows:
import pandas as pd# create connection to google spreadsheet service
conn = gspread.service_account(filename=PATH_GOOGLE_SERVICE_ACCOUNT)# load sheet from file in google cloud
sheet_data = conn_google.open(DATA_SPREADSHEET_NAME)# open specific worksheet within the spreadsheet document
data_worksheet = sheet_data.worksheet(DATA_WORKSHEET_NAME)# load data into pandas dataframe
data = data_worksheet.get_all_values()
headers = data.pop(0)
df = pd.DataFrame(data, columns=headers)
A connection is opened and first the file and later a specific worksheet are accessed. Finally the data is extracted with "get_all_values()" and converted into a pandas dataframe.
Similairly the data can be loaded back into a sheet from a dataframe:
# fill worksheet with header line
all_cols = df.columns.values
header = all_cols.tolist()
worksheet.insert_row(header, 1)# fill worksheet with data lines
data_rows = 
for idx in range(2, len(df)): # counting starts at 1 not at 0
pd_row = df.iloc[idx]
row = pd_row.tolist()
Data can be inserted line by line with "worksheet.insert_row()" but the number of data inserts per session is limited. So the command "worksheet.insert_rows()" must be used.
Together with the "worksheet.delete_rows()" command you have now all the tools to build a data pipeline in the cloud.
Check out my Git Repo where I wrote some functions to abstract this process inside usable procedures. Now you can start building your own use cases with this easy to use and flexible cloud data storage solution. Build a Web Application with a backbone entirely made out of Google Spreadsheets. How amazing that would be!
Using Google Spreadsheets with Tableau Public
Tableau is a Dashboarding Tool that tries to make decision making within companies more data-drive then ever. The company offers a service called 'Tableau Public" where dashboards can be created and hosted on their site making them available online.
In order to create a hosted dashboard you have to download the official App here and publish your dashboard from within the Application. The Dashboard can later on be modified from within your browser, but connections can only be modified from within the App.
The visualised data can be loaded and stored within the hosted dashboard. As great a service as this may be, this is only true for static data. The kind tat gets loaded once and then remains unchanged. But what if you want to update your data regularly and want your dashboard to be also updated? Here come Google Spreadsheets into play.
Data in the spreadsheets is already stored in the cloud and the dashboards form the Tableau Public service are updated as soon as new data is loaded into the spreadsheets. Column names are taken from the header row in the spreadsheet documents, but beware: Tableau Public requires the data in pivoted form. Depending on your initial data ingestion phase some preprocessing might be necessary.
Afterwards the data can be used like any other data source within Tableau Public. You can now start creating those stunning dashboards you always had in you.
To make the dashboard publicly available it must be published from within the application. Due to a bug this can take up to 15 minutes if the sync with the Google Spreadsheets is activated. Do not be disheartened if this happens, just wait it out.
If you want to see such a dashboard in action visit my site ai-for-everyone.org and check out my Blog Entry about the German "Sonntagsfrage".
Using Google Spreadsheets as data storage does not offer an enterprise-ready solution but it offers a free, flexible and easy to implement storage solution in the cloud with some nice features towards services like Tableau Public.
But it is still surprising how far this service can be stretched and what it can be used for.
Check out my Blog and my corresponding GitHub repository where i steadily build a Data Science Portfolio. You are also invited to engage with me on Twitter.