Highlights

Reddit: Scaling Reporting

/images/system-design-weekly/001/0010.png

Reddit had an ad analytics system that aggregated data per ad ID and per day. This data was stored in Redis as a Thrift object. This works well if an advertiser wants to see analytics for a given day. However, looking up stats per range of dates means that application had to get every value for every date in the range of dates, deserialize Thrift dictionaries, which is also a CPU intensive operation. Then aggregates were calculated on the application layer. Reddit team had to over-provision servers for such aggregates as they grew.

As a solution, it was decided to store events in Apache Druid with a column for every dimension to query. Druid is a columnar SQL database. Application can now run sub-second queries to the database instead of making aggregates heavy-lifting itself. Reddit also collaborated with Imply to build analytics dashboards.

Netflix: Orchestrated Functions as a Microservice (The Netflix Cosmos Platform)

/images/system-design-weekly/001/0011.png

Netflix receives videos that should be converted to various formats and resolutions to be played on different devices. The infrastructure evolved over time into an API that feels like a microservice but orchestrates different serverless functions under the hood.

There is an application and infrastructure parts of the system that are independent one from another. All subsystems communicate via high-scale low-latency priority queue called Timestone.

The system is modular, so each part of it can be developed and tested independently. For example, encoding in different resolution, different video codecs, audio codecs, subtitles generation, etc. The entire process can be monitored in a visual way with Nirvana portal.

The platform is built on top of Apache Karaf. The entire system is low latency and ready for spike in demand for computational resources. In this case, low latency means also allocation resources to perform operations.

SLO are measured in tasks per day and cost per task rather than tasks per second. Overall the system can be described as “microservices that trigger workflows that orchestrate serverless functions”.

Dropbox: Our journey from a Python monolith to a managed platform

/images/system-design-weekly/001/0012.png

Dropbox have been struggling with the Python monolith for a long time. This core legacy system (called Metaserver) also uses an outdated framework. It’s hard to break it down to smaller microservices.

Hybrid approach was chosen. It was decided to leave monolith with business logic be. Everything that can be extracted from monolith, like authentication, for example, should be separated from monolith.

This set of microservices evolved into a sort of serverless platform like Amazon Fargate. It’s called Atlas. Dropbox switches to gRPC technology where possible. They rely on Envoy proxy for HTTP to gRPC transcoding.

Atlas currently serves 25% of monolith traffic. Metaserver can be deprecated in the near future.

Levven: How Levven Keeps Your Smart Home Appliances On When The Internet Goes Out

/images/system-design-weekly/001/0013.png

IoT is slowly taking over the world. There were cases when some AWS region failure caused unexpected denial of some smart home devices, for example Roomba vacuums.

In this case study by Levven, smart home provider, described how they leveraged CockroachDB for smart home devices operating even if some regions in the cloud go down.

eBay: How eBay’s Distributed Architecture Surfaces More Item Listings for Buyers

/images/system-design-weekly/001/0014.png

eBay built a sophisticated system to understand user queries like “Apple Macbook Pro 2019”, in which brand, model, and year are recognized. This data is then combined with user filters like “Electronics → Computers, Tablets & More → Laptops & Netbooks → Apple Laptops → Release Year”. Such data can provide insights on popular searches.

Furthermore, eBay can automatically suggest fields to fill for a seller to improve chances of the item to be searchable.

Stack Overflow: Best practices can slow your application down

Stack Overflow is famous for running their service on a minimum number of servers. Historically, their tech stack is .NET and MSSQL. This means that horizontal scaling should be avoided as each new server needs an additional license. The other option Stack Overflow had was to scale up.

This led to a various performance optimizations, memoization, caching, etc. As a side effect, the software became hard to mock or use dependency injection for unit tests. Consequently, there is not much of unit tests in their code base. And it’s not the end of the world.

System Design

Building an End to End load test automation system on top of Kubernetes

Mongo Change Streams in Production

How ERGO implemented an event-driven security remediation architecture on AWS

How Post Content is Stored on Tumblr

Software Architecture

Modernising A Legacy Android App Architecture, Part One: The Single Activity

Modernising A Legacy Android App Architecture, Part Two: MVVM-ish

Modernising A Legacy Android App Architecture, Part Three: Applying The Refactor

Building Multiple Distinctly Branded iOS Apps from a Single Codebase

How Pinterest fights misinformation, hate speech, and self-harm content with machine learning

Culture

How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020

How async and all-remote make Agile simpler

How we acquired HEY.com

Embedding Security into Software Development Life Cycle

Products

Cloudflare: Flow-based monitoring for Magic Transit - Mitigating DDoS attacks on-demand or always.

Red Hat: Packaging APIs for consumers with Red Hat 3scale API Management - rate limits, pricing features management for developers, customer while accessing API.

Redis Labs: RediSearch Secondary Index Responds Faster, Streamlines Indexing - version 2.0 of in-memory index which is about 2.4-times faster comparing to the previous versions.

Okta Signs Definitive Agreement to Acquire Auth0