HUMAN Blog

Mi Casa Es Su Casa: How White Ops Leverages Snowflake for Sharing Data with Customers

There are two things White Ops loves more than anything else: humans and data analysis. We have an incredible R&D team that continuously examines our data to uncover new threats of digital fraud and abuse. But this is not enough — we want to be a force multiplier in the world, and we believe that more humans looking at data creates an exponential increase in the amount of fraudulent operations we can detect and prevent. Hence, our objective: expand the number of people that can analyze our data to protect humanity and the internet from digital fraud and abuse by cybercriminals and bad actors.

But how?

A goal like this requires a little help from your friends, and luckily for us, we decided to join forces with Snowflake a long time ago (2014 to be exact). In fact, we were Snowflake’s second customer (missed out on the top spot by only a week). White Ops’ data is securely stored in Snowflake’s industry-leading data warehouse solution. But well beyond data storage, Snowflake offers a full range of tools to securely share data with external organizations, making this a perfect platform for our objective.

Where to Find White Ops in the Snowflake Data Marketplace

To start, you can find White Ops’ tiles in the Snowflake Data Marketplace. Simply click on the tiles for Advertising Integrity, Marketing Integrity, or Application Integrity to be put in contact with a White Ops representative that will quickly assist you in configuring and setting up the data share.

Screenshot 2020-09-06 at 19.46.38

The Solution

Snowflake shares are the main gateway to allow Snowflake customers to import data from a specific account. Once imported (via IMPORT SHARE), they can create a database from it, effectively enabling consumers to query data as if it were a normal database.

Our workflow will target creating a share that is made available to customers for import. However, we need to be mindful of two constraints:

  • Confidentiality of data - disclosing only data that belongs to the specific customer
  • Ensuring shares are only accessible within their cloud platform region (ie. shares must reside in the same region as the customer)

Our solution is to replicate and share customer-specific information in the appropriate region, minimizing costs and respecting data privacy along the way. Let’s separate this into two steps: 

  1. Replicate data in customer-specific regions and
  2. Share regional data with customers.

Step 1: Data Replication

Data replication is performed automatically via the use of two interesting Snowflake commands: streams and tasks. The idea is simple:

  • consume the data that was pushed by our data pipeline
  • check if any of this data belongs to a customer who enabled data sharing
  • replicate this data on a database in the data owner’s region

A picture is worth a thousand words. This is the desired schema:

Share Snowflake DB_ Replicate Data

Our pipeline is storing data in a set of specific tables in a central database. We can use Streams to track modifications on these tables.

A Snowflake Stream is a logic object that creates a snapshot of the current state of the table it is attached to and records information about any change to the data in the table (data manipulation language changes, insert, update, delete). In our case, our tables have a write-once policy, with no deletion and no updates possible. This means we can simply track the inserts on the table with a Stream and then query the Stream itself to get the list of rows inserted.

Streams are based on transactions, meaning that in the same transaction we can query the Stream multiple times, returning the same rows every time. Once we close the transaction, the snapshot is updated and new rows will be added to the Stream.

The code to do that is simple:

Screen Shot 2020-11-17 at 5.23.43 PM

This is the first piece of the puzzle: we have a way to get the rows that have been “recently” added. In addition, we can repeatedly perform the same query inside a single transaction.

Now, we want to extract from these rows the ones that need to be replicated on a secondary database that will reside in the same region as the customer.

Snowflake has a public list of supported regions. Each customer in Snowflake has a primary account region, and can set up several secondary accounts. Again, the share we want to create must be in the same region as the primary account region of the customer.

We store the mapping from a given customer to their primary account region in a separate table. We then join this table with the rows returned by the stream, grouping them according to the region. In this way, we create a per-region group of rows, ready to be exported.

In SQL we perform the following queries:

Screen Shot 2020-11-17 at 5.25.05 PM

Now, for each region, we have a subset of rows that we need to push into a database stored on that region. It’s simple enough.

The final missing piece is exporting data: we selected the rows, now we need to export them in each region. We do this via database replication.

As mentioned, Snowflake uses the concept of primary and secondary accounts. A primary account is set up during the onboarding process, and it’s linked to a specific region. We cannot change the region associated with this account, but we can create secondary accounts that are linked with a new region. Therefore, we need to create a secondary account in the region where we want to replicate the data to, create a new database in the primary account that will hold the rows that are being exported, and finally, configure the secondary account to hold a replica database of the one just created in the first one. 

Creating the secondary account can be done following the procedure on the Snowflake User Guide. From there, we create the new database and set up the secondary account to replicate it.

Screen Shot 2020-11-17 at 5.26.23 PM

That’s it! We now have:

  • a stream that tells us the rows that have been added to the primary table
  • a table that tells us which region each customer belongs to
  • a series of databases that replicate data for a specific subset of customers in the appropriate regions

Data replication step: Complete!

Bonus! Automate All the Things

Of course, manually inserting data and refreshing databases isn’t scalable. We need to automatize the steps by fetching data from the stream and refreshing the databases automatically, at regular intervals.

We do this by using Tasks. Tasks are Snowflake objects that can periodically run a specific set of SQL statements. We use two tasks:

  • one Task is in charge of reading the stream and inserting a subset of data in the appropriate regional database
  • one Task (in the secondary region) is in charge of refreshing the regional (replicated) database

However, this is not enough. The problem is that we do not start with a fixed number of regions. We want to add one region at a time, according to the customers we are sharing data with. We need a way to inject into our SQL statements a database name taken from the customer->region table. We need Stored Procedures.

A Stored Procedure is an object that can dynamically create and execute SQL statements using javascript. In a stored procedure we can easily read the list of regional databases and build the SQL statement that will be inserting the data of a specific subset of customers on that specific database. Furthermore, thanks to the ability of Streams to return the same set of data on multiple reads performed in a single transaction, we can repeat this operation for every database.

An example of a Stored Procedure is:

Screen Shot 2020-11-17 at 5.28.47 PM

And finally, let’s call this procedure in a Task that opens and closes a single transaction:

Screen Shot 2020-11-17 at 5.29.26 PM

Our Task is configured to check every 5 minutes if the Stream named “mystream” holds new data and, in that case, call the insert_data_to_all_replica_dbs procedure to insert data in the various regional databases and replicate the new data in the customer regions.

NOTE: As evident, the Stored Procedure uses string concatenation to specify the target database. This is necessary as we can’t use variable binding to specify the table name. It’s important to ensure that the table customers_regions (source of the db_region parameter) is not updated with arbitrary input. In our case, we use a fixed ENUM to enumerate the possible values for the db_region parameter and we allow only inputs from the enum to be inserted.

Step 2: Data Sharing

So far so good. Every region holds a replica of a particular database holding data of the customers whose primary account resides in that specific region. Now we want to allow customers to query their own data without disclosing rows that belong to other customers in the same region. We are going to need a couple more arrows from the satchel of tools available in Snowflake: we are going to use Secure Views and Shares.

Secure Views are a special version of a view that guarantees privacy of data. While slightly slower than normal views, they guarantee the query optimizer will not start executing filters before performing the authorization predicate. This is fundamental for our purposes.

Our workflow can be synthesized as the following:

  • create two schemas, a public and a private one
  • tables holding data are stored in the private schema. No customer has direct access to this schema
  • an additional table (customers_accounts) in the private schema links the customer IDs with Snowflake account names
  • configure in the public schema a set of secure views. The Secure Views select data from the result of the join between the data table and the customers_accounts table, selecting only data belonging to the specific customer
  • share the Secure Views via a share

Or, as a picture:

For Blog_ Share Snowflake DB_ Share Secure views

Note that this entire database is replicated on the secondary account, but the Share is not. That means we can fully control data from our primary account, and only access the secondary account when we need to set up the share.

From the first step we have a replicated database holding the two schemas: private and public. Now in this order, we want to:

  • create the customers_accounts table
  • configure the views
  • configure the share

The table customers_accounts is a simple table that links customer ID with the corresponding Snowflake account names. The Snowflake account name is the primary account name set during the onboarding procedure, and it’s unique per Snowflake account (ex: it’s independent from the user performing the query). It can be retrieved via the current_account() function.

The table has a simple format:

Screen Shot 2020-11-17 at 5.31.41 PM

This table will be used as a pivot to allow customers to access only their data. Removing an entry from this table will effectively deny access to a specific customer to its own data.

The secure view is where the most interesting operation is performed. The idea is to join customers_accounts with the data table (mytable) and use the current_account() function to select the Snowflake account name of the user performing the query using the view. This will dynamically select only the rows in mytable that correspond to that specific account name, ensuring only a customer’s own data is disclosed to them.

Screen Shot 2020-11-17 at 5.32.26 PM

Victory is near! As a final step, we need to create the share and add the public secure view to the share which will become ready to be used by our customers.

This is simple as performing a CREATE SHARE operation, granting the necessary permissions to the appropriate views.

Screen Shot 2020-11-17 at 5.33.07 PM

And now our share is ready to be imported by our customers!

Authorize a customer access to their data

With the previous in place, it is now simple to authorize a customer access to their data. The procedure is as follows:

  • add the customer to the customers_regions table with the primary region of the customer
  • add the customer to the customers_accounts table with the Snowflake account name of the customer
  • alter the share, adding the Snowflake account to the accounts of the share

Let’s take, for example, the case where the customer 1234 holds a Snowflake account “ACME” with primary region eu-west-1. The sequence of queries to perform are:

Screen Shot 2020-11-17 at 5.33.51 PM

Revoke a customer access to their data

Revoking a customer’s access to their data is performed by simply reversing the queries performed during the authorization procedure.

Screen Shot 2020-11-17 at 5.34.47 PM

Importing a Share as a Customer

After configuring the share and adding the account, the customer can simply consume the data as if it were a database in their Snowflake account. 

To create a database out of a share that the Snowflake account has access to is as simple as performing the following query:

Screen Shot 2020-11-17 at 5.35.26 PM

NOTE: Importing a share requires the IMPORT_SHARES privilege.

Working with Shared Data

White Ops offers three shares:

  • WHITEOPS_SHARE_MARKETING_INTEGRITY: for our Marketing Integrity customers
  • WHITEOPS_SHARE_APPLICATION_INTEGRITY: for our Application Integrity customers
  • WHITEOPS_SHARE_ADVERTISING_INTEGRITY: for our Advertising Integrity customers (both FraudSensor and MediaGuard)

Data is provided in transactional form. No aggregation is performed to ensure customers can analyze their data as preferred.

Let’s see some examples of queries to get a taste of what can be done.

Marketing Integrity

Find user sessions of campaigns that are being targeted by bots:

Screen Shot 2020-11-17 at 5.36.44 PM

Find the pages that bots enter or traverse through, per pageview:

Screen Shot 2020-11-17 at 5.37.19 PM

Application Integrity

Find the event types that are most abused by bots:

Screen Shot 2020-11-17 at 5.38.17 PM

Find the threat categories of bots:

Screen Shot 2020-11-17 at 5.39.00 PM

Advertising Integrity 

Breakdown fraudulent transactions according to the platform and the environment type (Mobile, Desktop, CTV)

Screen Shot 2020-11-17 at 5.40.41 PM

Find sophisticated IVT hotspots (domains that have a high percentage of bot activity)

Screen Shot 2020-11-17 at 5.41.31 PM

Identify fraud rates by supplier:

Screen Shot 2020-11-17 at 5.42.18 PM

Identify fraud rates by publisher:

Screen Shot 2020-11-17 at 5.46.02 PMConclusions

In this work we debunked how we leverage our data platform partner Snowflake to share data with customers while guaranteeing data security and privacy and while minimizing customer costs and effort. Our solution was split in two parts: first, we replicated the data of a subset of customers in the cloud platform region where their primary Snowflake account resides; and second, we shared the data with these customers. To perform this workflow we used Snowflake Streams, replicated databases, and secure views. Furthermore, we used Tasks and Stored Procedures to automate the workflow. Finally, we showed how to import a Share as a customer and provided some examples of queries customers can use to explore their White Ops data.

At White Ops, simply protecting customers is not enough: we want to provide them with tools that make it easy to actively join us in the fight against digital fraud and abuse. Data is the most important weapon in our inventory, so securely and seamlessly sharing it with customers in their preferred environment, like Snowflake, is a big win for team Human in the war against bot-wielding cybercriminals and bad actors.