One challenge: Given a catalog of merchant products, auto-magically determine similar products to produce “recommended product” groupings … without any input from the merchant.
The Breadcrumb Analogy
It is easy enough to find products with similiar attributes, but the difficult part is figuring out how to rank the relevance of attributes in the absence of a merchant-supplied hierarchy.
One way to assess the position of an attribute in a subjective hierarchy is to see how many times each attribute occurs in a given merchant’s catalog — the greater the frequency of occurrence of an attribute, the higher hierarchical rank it can be assumed to have.
For example, in a clothing catalog, a logical ranking of product attributes might be represented by the following (analogous to a breadcrumb trail):
Here it is reasonable to assume that there will be more occurrences of “men’s” than of “shoes,” and so on.
The breadcrumb analogy can, however, be misleading:
The problem here is that “black” probably occurs as an attribute far more frequently than “wingtips,” or even “shoes.”
The trick, then, is to assess relevancy, rather than to try to impose a hierarchy.
At this point the problem begins to look like a graphing problem, where products can be represented as nodes and attributes (i.e. categories) can be represented as edges. The weight of the edges is determined by the frequency of their occurrence (although inversely, as it turns out).
Visualizing the Problem
If it’s starting to look like a graphing problem, then of course a visualization might help us determine how interconnected our merchants’ products are, and give us a clue as to how successful grouping by attribute frequency might be.
Let’s look at two merchants’ catalogs visualized as networks:
The first merchant sells a wide range of clothing and accessories. We can see right away that most of the items in their catalog show varying degrees of interconnectivity, but also that there are isolated groups. An investigation of the node values reveals that the isolated groups are bound by attribute edges such as “new” and “on sale,” while the great mass of interconnected items share attributes such as “men’s” and “pants,” as we expected.
For the bulk of this catalog, then, it appears that our attribute similarity grouping idea shows promise.
The second merchant has a small catalog consisting mostly of jeans and T-shirts. From the network diagram, we can see right away that all products are tightly bound by a small number of attribute edges — mostly “men’s”, “women’s,” “jeans,” and “t-shirt.” In this case, trying to find a similarity hierarchy based on product attributes may not work as well.
Time to test our strategy. Let’s build a simple database that will map items to each other by the combined weights of their shared edges.
We could create a table like this:
But this presents a scaling problem, since the size of our similarity database will be the determined by the total number of interconnections between products — potentially huge!
A look at our visualization gives us a hint at a more compact storage strategy. If we describe each product as the sum of its attributes (expressed as a hash), then we discover that there are (in a well-classified catalog) far fewer unique attribute hashes than unique product IDs. We can simply store product IDs related to each unique hash of attributes, as opposed to storing each individual product relationship.
So our new table looks like this:
And then, to simplify hash lookups, we’ll store a map of product IDs to attribute hashes:
Now all we need to do is some real-world testing — a simple script that will take a product ID and return a list of similar products should do the trick.
A Snag and a Solution
In testing, we discover a problem with our strategy. The most heavily weighted attributes are actually the least significant (see the “wingtips -> black” example above).
The solution? Invert the significance of attribute frequency by defining the weight as 1/(number of occurrences). Now “wingtips” is the most significant, and our recommended-item list looks reasonable.
Where This Approach Won’t Work
This approach will work for merchant catalogs that are well-categorized, meaning that the ratio of products to attributes is not skewed too far in either direction.
Our T-shirt and jeans example does not have enough attributes defined, so the recommended product list returned by our test script is not especially meaningful. Connections between products are too general.
On the other extreme, a merchant with an attribute-to-product ratio approaching 1/1 (i.e. too many too-specific attributes) will produce recommended-item lists that are too small and too similar.
In these edge cases, we’d recommend an alternate approach, which will typically involve collecting more information from the merchant.