IDEAS Amundsen Presentation

Saturday, October 26th 2019
Alagappan Sethuraman | Engineering Manager, Lyft
Daniel Won | Software Engineer, Lyft
Disrupting Data Discovery

Agenda
• What is Data Discovery?
• Challenges in Data Discovery
• Introducing Amundsen
• Amundsen Architecture
• Impact and Future Work
2

Data is used to make informed decisions
4
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
Data-driven decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualization
4. Share insights and/or make a decision
Make data the heart of every decision

What is Data Discovery?
Consider a data-driven decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/create a visualization
4. Share insights and/or make a decision
5
Data Discovery

Challenges in
Data Discovery
6

• My first project is predict the attendance for IDEAS conference
• Goal: Help the office team make a decision on number of chairs to
provide?
• Idea: Let’s take a look into attendance from previous conferences… but
where do I look?
Hi! I’m a new Analyst!
7

• Ask a friend/manager/coworker
• Ask in a wider Slack channel
• Search in the Github repos
Step 1: Search & find data
8
We end up finding tables: hosted_events
that seems to be the right one

• You find several columns that might be what you're looking for:
‒ booked, registered, and attendance
• But you still have many questions such as:
‒ Does attendance include staff?
‒ What's the difference between booked and registered?
‒ How accurate are these figures?
Step 2: Understand the data
9

Step 2: Understand the data
● Look for further documentation on these columns
○ Where does this documentation live?
● Ask an expert who knows this table
○ Who is an expert?
● Run some queries to try to figure it out at the risk of being wrong
10
SELECT * FROM schema.host_events
LIMIT 100;

Nearly 1/3 of Data Scientist time is spent in Data
Discovery
11
• Data discovery is a problem
because of the lack of
understanding of what data
exists, where, who owns it, & how
to use it.
• Data Discovery provides little to
no intrinsic value
• Impactful work happens in
Analysis

What is Amundsen?
• Built at Lyft, official launch in late 2018
• Inspired by Google Search, Airbnb Data Portal, and
Apache Gobblin
• Named after Norwegian explorer Roald Amundsen
‒ Led the first expedition to the South Pole
‒ Led the first expedition through the Northwest Passage
13

Computed Column Statistics
Disclaimer: these stats are arbitrary.

Why choose a graph
database?
25

Neo4j is the source of truth
for editable metadata
29

Why not propagate the editabled metadata back to
source
30

source
31

source
32

source
33

Amundsen’s Impact at Lyft
• Deployed at Lyft for over 1 year
• Over 700 Weekly Active Users
• 90% penetration among Data Scientists
• Reduced mean time to discovery by 75%
• Also used by Data Eng, Software Eng, PMs, Ops, Marketing Managers,
and more
35

• github.com/lyft/amundsen
• 200 github stars, 10 companies contributing back
• Slack channel 250 people from 30 companies
• Presented at conferences in San Francisco, Barcelona, Vilnius, Moscow, LA,
NYC by Lyft employees and community
Amundsen is Open Source!
41

Community Overview
42
ContributorsActivecommunity

Alagappan Sethuraman | /in/alagappanut
Daniel Won | /in/danwon
Project Code @ github.com/lyft/amundsen
Icons under Creative Commons License from https://thenounproject.com/
44

IDEAS Amundsen Presentation

More Related Content

IDEAS Amundsen Presentation

Editor's Notes