Compressing URLs in your Webapp, for size and speed

Last year I had a chance to talk about the internals of our service: Pathtraq at Percona Performance Conference (slides), in which I described the methods we use to compress the URLs in our database to below 40% of the original size, however had not released the source code since then.  I am sorry for the delay, but have finally uploaded the code to github.com/kazuho/url_compress.

It is generally considered difficult to achieve high ratio for compressing short texts.  This is due to the fact that most compression algorithms are adaptive, i.e., short texts reach their end before the compressors learn how to encode them efficiently.

Our approach uses an algorithm known as "static PPM (prediction by partial matching)", that uses a pre-built table for compressing tables.

By using static PPM it is possible to achieve high compression ratio against hostnames or domains (e.g. "www" or "com") or English words that occasionally appear in paths or query parameters (e.g. "search").  And even in case of compressing unknown words in URLs the method works fairly well by utilizing the lower-order probability prediction tables for syllable-level compression.

Th

C / C++ | in English | MySQL
Oct 21, 2010 14:46



Comments

View Comments (0)

Post a comment