Pipeline Team

People

Team members

Mission

Provide the best events pipeline in the world.

Objectives: Q4 2023

Keep the lights on
- Increase isolation with Redis - Xavier Vello
- Roll out RDKafka - Brett Hoerner
- All events ingested through capture-rs - Xavier Vello
- Make GeoIP cheaper - Ted
- moving event properties follow-ups, cohorts, ... outside of ingestion hot path - Ted
Batch exports migrations - Tomás Farías Santana
- Redshift and maybe replicator depending on demands
- Add logs/metrics/alerting
- Higher frequency
Webhooks v2 - Tiina Turban
- the new delivery system that can
  - retries
- observability to both users and us for success rate, retry count, latency
- endpoint disabling
- multiple webhook destinations
- UI
- can be used for automation for other teams needs
move webhooks, resthooks and onEvent apps to use the new system
PoE - Brett Hoerner
- KV store for person overrides
- Stretch: Squash person overrides
- Stretch: persons without ordering constraints

Objectives: Q3 2023

Define and enforce limits on what users can send us - Owner @xaviervelo
Deprecate and improve apps
- Finish batch exports so we can deprecate export apps - Owner @tomasfarias
- Remove all processing apps - Owner @tiina303
Figure out the pipeline architecture for the next 2 years - Owner @timgl

Objectives: Q2 2023

Objective: CDP - Owner @fuziontech
- Key Result: Rock solid webhooks - Owner @hazzadous
  - Edit: deprioritized as it is not a burning bridge, and instead:
  - Improve quality of existing offering by removing questionable Apps: mark apps that store lots of data in postgres, or are generally lower quality as non-global or disable altogether.
  - Separate export and CPD Apps within the interface to enable separating the two products. Bill these two products separately.
  - Make batch processing exports rock solid as below.
- Key Result: Rock solid batch processing (exports) - Owner @tomasfarias
Objective: Pipeline unblocks query performance and ensures data quality - Owner @tiina303
- Key Result: Persons on Events write path and squash shipped with monitoring and alerting
- Key Result: Backfills are complete for all teams
- Key Result: Guaranteed deduplication of UUIDs in events table within 7 days (with docs explaining it)
- Key Result: Query time deduplication of events to cover until ClickHouse deduplication is complete
Objective: Pipeline Robustness and Reliability - Owner @xvello
- Key Result: Errors cause kafka lag instead of DLQ production e.g. person creation failures cause kafka lag not go to DLQ
- Key Result: Plugin server scales elastically and deploys do not create more than 2 minutes of lag

Objectives: Q1 2023

Objective: Performance
- Key Results: We have wrapped up the person-on-event project and have deprecated the old non-person-on-events queries
  - Why? Performance speed up
- Key Results: We have reduced the cost per event for capture by an order of magnitude
  - Why? Infra savings and improves performance
Objective: Reliability
- Key Results: We have converted all current US dashboards into IaC dashboards configured in Terraform and made all necessary migrations from StatsD to Prom to support this.
  - Why? Gets US and EU equivalent in terms of monitoring
- Key Results: All of our alerts have runbooks
  - Why? Improve incident recovery times and share knowledge with all engineers, so that most incidents can be resolved without escalating to the team
- Key Results: Backfills do not slow us down or take down the system. We have tests for this.
  - Why? Improves service quality and protects against bad actors
- Key Results: Erroring apps fail gracefully, do not take down anything else, and we have tests to prove this. And re-enable after temporary unavailability
  - Why? Improves service quality and tackles customer annoyance of apps turning off when there's an error

Responsibilities

Team Ingestion owns our ingestion pipeline end-to-end. That means we own:

Django server ingestion API
Ingestion app server
Exports and webhooks
Transformation apps
Client libraries, where it pertains to event ingestion
Kafka and ClickHouse setup, where it pertains to event ingestion
Async migrations

Our work generally falls into one of three categories, in order of priority:

Ingestion robustness

On the road to providing the best events pipeline in the world, we need to build a system that is robust.

To do so, we must ensure, in order of priority:

Data integrity: Events ingested should be correct
Availability: We should not lose events
Scalability: We should be able to scale to massive event volumes
Maintainability: It should be easy to debug and contribute to our ingestion pipeline

Thus, it is our responsibility to consistently revise our past decisions and improve processes where we see fit, from client library behaviors to ClickHouse schemas.

Scaffolding to support core PostHog features

In order to achieve company goals or introduce new features (often owned by other teams), changes to our ingestion pipeline may be required.

An example of this is the work to remodel our events to store person and group data, which is essential to ensuring we can provide fast querying for users. While querying data is not owned by this team, the change to enable faster queries requires a large restructuring of our events pipeline, and thus we are owners of that component of the project.

In short, a core responsibility of our team is to enable other teams to be successful.

Pipelines

We're building the most extensible and integrated data pipeline, which we treat as a separate product just like product analytics and feature flags, for example. The PostHog data warehouse is one destination, and the most important, but we care about getting data into all the places where it is valuable.

How do we work?

We run a 30 minute standup on Monday, Wednesdays, and Fridays, and extend the slot if we feel the need to have a longer synchronous discussion about a specific topic. We look at the Pipeline Team GitHub Board, and we document every standup in this doc.

We are happy to sync anytime if we feel it is important to do so. This is generally coordinated on Slack where someone will start a Slack huddle. Some of the reasons we sync include: debugging outages, sharing context (including shadowing), making decisions when there's been a deadlock, and pairing sessions.

We have a single owner assigned to each priority and sprint goal, but we work together on goals as a team. We try to avoid lone-wolfing, so in general we'll have 2 sprint goals per sprint.

Secondary (a.k.a. Luigi)

We have a secondary schedule. The idea is to have a single person catch all of the interruptions and context switching. A secondary rotation is two weeks long and aligns with our sprints. We don't assign sprint goals to the secondary.

During this rotation their first priority are these responsibilities:

Firefighting
Questions in #support-pipeline
Customer Success triage
Adding or improving runbooks
Follow ups from outages
Cleaning up Sentry

If there is a long running customer issue, we'll assign an owner as a team.

It's acceptable for the secondary responsibilities to take up all of your time. Otherwise, you can help out the rest of the team with sprint goals. It's up to you to prioritize correctly.

Slack channel

#team-ingestion

What we're building

PostHog Customer Data Platform
We cover some of the functionality of existing CDP solutions, such as Segment. By creating a UI which specifically encompasses the ideas of "Sources" and "Destinations", along with building out more of the integrations, we can turn PostHog into a leading CDP solution.
Progress
- PostHog CDP