/
...
/
/
2. Lakehouse: a new kind of platform, Matei Zaharia, Databricks
Search
Duplicate
Try Notion

2. Lakehouse: a new kind of platform, Matei Zaharia, Databricks

👉 Slides
Historically, Data warehouses weren’t designed for data science, high cost for huge datasets
2010: datalakes: non tabular data, store all raw data with a single method
Open data format like Parquet accessible directly by DS
Problem with 2 tier architecture: cheap to store all data but complex
Data reliability suffers from multiple storage systems, different ETL jobs
Timeliness suffers from extra steps before data can be used in warehouses
High cost from parallelisation and duplication
Key technology enabling lakehouse
metadata for data lakes
track which files are part of table version to offer feature management (query files from a table)
versions of files using delta lakes, to avoid crashing jobs when updating data
all version of tables, you can time travel and stream changes
Lakehouse engine designs: performant SQL
4 optimisation tricks:
auxiliary data structure (statistiques, always consistant)
for each parquet files, statistics like min, max year, uuid
when you read your snapshot of the table, you also read the statistics by using them during SQL query filtering
caching
vectorisation (Databricks photon, using the Parquet format)
New query engines like databricks use these techniques
Declarative I/O format for ML
ML using Warehouse is painful because ML tools don’t query in SQL format, add new jobs for ETL
ML over Lakehouse can use Parquet, and spark can do query optimisation
MLflow can also help with the ML lifecycle and data version tracking
Lakehouse combines the best of DWs and lakes
Q&A
Data quality?
Automatic test and table versioning
End to end sanity checks on tables
Snowpark vs Databricks?
Snowpark is a proprietary APIs to run java and python workflows, can’t use existing Spark methods, Databricks support open-source API (PyTorch, Keras, XGBoost distributed)
Data mesh vs Data architecture platform
Different teams can manage storage, easier to have a decentralised ownership of data