Uber's engineering team wrote about how their big data platform evolved from traditional ETL jobs with relational databases to one based on Hadoop and Spark. A scalable ingestion model, standard transfer format and a custom library for incremental updates are the key components of the platform.
Various teams at Uber use big data for things like forecasting rider demand, fraud detection, geospatial computation and addressing bottlenecks in the rider-partner sign-up process. Their initial solution, developed before 2014, was based on MySQL and PostgreSQL. The relatively small amount of data they had then - a few TB - could fit into these RDBMSs, and users had to figure out themselves how to query across databases. The city operations teams, data scientists and analysts, and engineering teams used this data. An effort at standardization led to the adoption of Vertica - a column oriented analytics platform - supported by ad-hoc Extract-Transform-Load (ETL) jobs. A custom query service provided access to the data using SQL. The amount of data grew to 10s of TBs, accompanied by a growth in the number of teams and services that used this data. The key problems Uber faced at this stage were a lack of horizontal scalability, rising expenses and data loss arising from the lack of a formal schema between data producers and consumers.
The engineering team adopted Hadoop in the next phase to ingest data from multiple stores without transforming it. Apache Spark, Apache Hive and Presto as the query engine were part of the stack. Vertica was fast, but could not scale cheaply, while Hive had the opposite problem(PDF). Storing the schema and the data together using a custom schema service solved the problems faced in the previous phase. The amount of data grew to 10s of PBs, and the data infrastructure ran 100k jobs per day across 10000 virtual CPU cores.
In spite of horizontal scalability being in place, they ran into HDFS bottlenecks. In an HDFS cluster, a NameNode keeps track of where each file in the cluster is kept, and maintains the directory tree. HDFS is optimized for streaming access of large files, and a lot of small files makes access inefficient. Uber ran into this problem when their data volume grew beyond 10 PB. They got around the HDFS bottlenecks by tuning NameNode garbage collection, limiting the number of small files and an HDFS load management service. In addition, the data was not available fast enough to end users. Reza Shiftehfar, Engineering Manager at Uber, writes that
Uber's business operates in real time and as such, our services require access to data that is as fresh as possible. To speed up data delivery, we had to re-architect our pipeline to the incremental ingestion of only updated and new data.
Image courtesy - https://eng.uber.com/uber-big-data-platform/
The result was a custom Spark library called Hudi (Hadoop Upserts anD Incrementals). It forms a layer on top of HDFS and Parquet (a storage file format) that allows for updates and deletes, thus meeting the goal of ETL jobs becoming incremental. Hudi works by letting users query by their last checkpoint timestamp to fetch all the data that have been updated since the checkpoint, without the need to run a full table scan. This brought down the latency from 24 hours to less than an hour for modeled data and 30 minutes for raw data.
Along with Hudi, the other addition to the latest phase of Uber's big data platform is data ingestion through Apache Kafka with metadata headers attached. A component called Marmaray picks up the changes from Kafka and pushes them to Hadoop using the Hudi library. All this is orchestrated using Apache Mesos and YARN. While Mesos is good for long running services, YARN is better suited for batch/Hadoop jobs. Uber uses its custom scheduler framework Peloton, built on top of Mesos, to manage its compute workloads.