At Monetate, our data warehousing system is in a period of transition. We have a legacy data warehouse based in Hive and Elastic MapReduce, with backing data stores in S3. Over the past year, we’ve been migrating services over from the legacy data warehouse to Redshift, with the eventual goal of shutting down our Hive infrastructure.

Amazon bills Redshift as “a fast, fully managed, petabyte-scale data warehouse service.” Compared with our implementation of Hive, we found Redshift to be much more performant. We weren’t the only ones to come to this conclusion.

One of the services we had to migrate was product recommendations. We aren’t a retail company, but many of our clients are. We base our recommendations upon observed customer behavior through Monetate’s integration with our clients’ websites. That allows us to generate data-backed recommendations.

When users express interest in an item (for instance, by viewing its product detail page), we can show them other products that might be of interest to them. The algorithm we use to produce the recommendations boils down to “people who bought this item also bought these other items.”

Implementing product recommendations in Redshift

Our old relational data warehousing solution, Hive, was not performant enough for us to generate product recommendations in SQL in our configuration. Instead, we implemented the algorithm in MapReduce (MR) using mrjob. I presented that approach to a Philadelphia Meetup Group in a talk about product recommendation engines in MR.

To implement our “people who bought this item also bought these items” recommendations in Redshift, the behavior that we needed to duplicate was:

For each product, yield the top 50 items purchased alongside it, in descending order of popularity of that product pair.

We had a hunch that Redshift was going to be performant enough to duplicate the functionality of the MR job in pure SQL. To compute the popular coincident-purchased products, we first tested it using this query, which turned out to be very performant:

Example output:

pid1 pid2 frequency
0 1 10
0 2 7
0 3 1
1 2 1
2 3 1

Our purchase line table has one row per product in an order placed on a client’s website. A self-join against that table allows us to generate the desired product pair information. From p_a we are getting the product id that the visitor would be viewing, and from p_b we are getting the other products that people often purchased alongside the viewed product.

The JOIN statement in this query is where the real magic happens:

  • p_a.user_id = p_b.user_id Generated product pairs need to be from the same user
  • p_a.product_id != p_b.product_id Ensure product pairs are different products

Then by grouping on pid1, pid2, count(1) computes how often those two product ids were purchased by a single user.

After we demonstrated that this query was performant, we knew we were on the right track to transition this process from MR to pure SQL.

Limiting results to the top 50

To achieve parity with our MR solution, we had to implement the top 50 filtering. We wanted to preserve this behavior to limit both the output of the process and the number of rows we would need to modify in our data store after each run of the algorithm.

This is where I got stuck implementing the solution in SQL. What we wanted to do was only yield the top N items on a per-”GROUP BY” basis (in this case, on a per-product basis).

I didn’t know of a way to do this, and I was preparing myself for the pain of limiting the result set in code rather than pure SQL. That is when I discovered window functions.

We use MySQL extensively at Monetate. While it’s normally a great database for our needs, it does not have window functions. In MySQL, the functionality can be duplicated with user variables, but window functions are a much more elegant solution to this problem.

Redshift is based on PostgreSQL – you use postgres drivers to connect, psql for a command line interface, etc. While not all features of PostgreSQL are exposed by Redshift, window functions are available.

This example from the PostgreSQL documentation using rank, a window function, is a perfect example of how to solve the problem we had:

In English, that query returns the top two paid employees per department.

Awesome! We can apply that directly to our query to achieve the top 50 filtering.

Here, we add rank as a column and then wrap it in a SELECT statement to filter on the value in the rank column.

Example output:

pid1 pid2 frequency rank
0 1 10 1
0 2 7 2
0 3 1 3
1 2 1 1
2 3 1 1

Let’s break down the rank line.

rank() OVER (PARTITION BY pid1 ORDER BY frequency DESC, pid2) rank

First, all product pairs are grouped by their pid1, as described by the PARTITION BY clause. Within each grouping, the rows are then ordered by frequency in descending order and then by pid2. With rank, if the order by statement has tied rows, they will get the same value. Different product pairs could have the same number of coincident purchases, so adding pid2 to the ORDER BY statement guarantees that we have a well-defined ordering among all rows in a group.

Then, after the rows have been grouped and ordered, the rows receive a rank denoting their place in the order within each group.

As shown in the example output, the first three rows would belong to the same group. Then they are ordered in descending order with respect to their frequency value. Then the rows receive a rank based on their place within the grouping.

Other window functions

In addition to rank, their are many other window functions available in Redshift. In PostgreSQL any aggregate function can be used as a window function. Like rank, there are some other functions that are only available as window functions. Two in particular that I’m looking for ways to apply to my craft are lag and lead. Those functions provide the row that comes before or after the current item by a certain offset, when ordered by the window clause. Like rank, those two functions should enable me to do things in pure SQL that I previously thought were impossible.

Not only speed but savings

We’ve demonstrated the ease with which we can transition one of our legacy MR jobs to pure SQL running against Redshift. Not only was this a neat problem to solve, but it will end up lowering the cost of our backend services.

Currently, the nighttime use of our Redshift cluster is low, so adding work onto the cluster during non-peak hours will not increase the cost of the cluster for us and should not noticeably affect performance.

Meanwhile, we are renting Elastic MapReduce clusters just to generate the product recommendation data. Once the transition is complete, it will be one fewer thing that we will need to rent an EMR cluster for, saving us money and allowing us to unify our data warehousing solution after we migrate all our products off the legacy system.