21. Feature engineering at scale with Dagger and Feast, Ravi Suhag, Gojek

https://www.youtube.com/watch?v=9B9qqqJVm4M&t=48s&ab_channel=Tecton (opens in a new tab)

Feature engineering: transforming raw data into high-quality model inputs and feature vector

Put it in a features store
Better features provide better flexibility for modelling and simpler models. also better models
Challenges of features engineering
- Inconsistency between training and serving
  
  Especially in real-time, you want the same transformation to happen as with batches transformation
- DS don’t want to manage data pipelines, focus more on modelling
- Scaling data infra
  
  Hard for DS to manage Flink, Spark and Kubes cluster
- Real-time features need skilled data eng, to write more optimized jobs
Goals
- Unify processing in real-time and batch. The logic needs to be the same. Self-service platform
- Elastic infra, every time a DS adds a job, the infra can scale up
- Standard and reusable transformations, a standard way to define that
- No extra skills required from DS to create real-time features
Dagger is our solution, on top of Flink for stateful processing of data
Pipeline with dagger and feast
Dagger architecture
Features:
- SQL first: query, writing and templating queries
- Flexible, various ways to extend it using UDFs processor/post-processor
- Stream enrichment from DB sources or API to bring context data for real-time processing
Self-service:
- DS shouldn’t be managing data pipelines
- only specify their query and get the job running, and monitor it, have alerts, some templates
Managing with GitOps
- Can specify deployment and version control with yaml
SQL first
- Not turning complete, thus it will always complete, even if it fails, so won’t mobilize all the compute power in a data center

UDFs

User define function can be used in SQL (like geohash)

SELECT
	data_location.longitude AS long
	data_location.latitude AS lat
	**GeoHash**(
		**data_location.longitude,
		data_location.latitude,
		6
	) AS geohashPickup,**
	TUMBLE_END(rowtime, INTERVAL '60' SECOND) AS window_timestamp
FROM
	data_stream
GROUP BY
	TUMBLE(rowtime, INTERVAL '60' SECOND),
	data_location.longitude,
	data_location.latitude

Data masking

encryption on sensitive data, hashing

SELECT
	event_timestamp,
	test_data
FROM
	data_stream
******* Post processor config: ******
{
	"internal_source": [{
		"output_field": "*",
		"value": "*",
		"type": "sql"}],
	"transformers": [{
		"Transformation_class": "io.cdpf.dagger.functions.transformer",
		**"transformation_arguments": {
			"maskColumns": ["test_data.data1"]**
	}}]
		
}

Hybrid data sources

Managing logic in 2 places are an issue, if you’re doing real-time streaming and batch processing with static and historical data

Dagger allow to use data simultaneously from different sources
```
[{
	"INPUT_SCHEMA_TABLE": "booking_log_stream",
	...
	"SOURCE_DETAILS": [
		{"SOURCE_TYPE": "BOUNDED", "SOURCE_NAME": "PARQUET_SOURCE"},
		{"SOURCE_TYPE": "UNBOUNDED", "SOURCE_NAME": "KAFLA_SOURCE"}
]
}]
```
In this logic, you can backfill historical data using streaming source, and join data across multiple streams

Stream enrichment

Use external data source (service endpoint, object stores, cache) to enrich real-time event and add context

[{"internal_source": [{
	"output_field": "booking_log",
	"type": "sql",
	"value": "*"
  }]},
 **{"external_source": {
		"es": [{
			"host": "127.0.0.1",
			"port": "9200",**
			"endpoint_pattern": "/customers/customer/%s",
			"endpoint_variables": "customer_id",
			"stream_timeout": "5000",
			"output_mapping": {
				"customer_profile": {"path": "$._source"}
			}
	}]
 }]

Stream inference

Attach prediction to the same input event, in the streaming infra

In details
- Ex: attach pricing prediction, for multiple consumers
- All these predictions log are now in the warehouse, which every one can now use
- Streaming are very self-service from a DS point of view
Dagger adoption at Gojek
- 300+ dagger jobs for feature engineering
- 50+ data scientists creating dagger jobs
- 10TB+ data processed every day
Dagger is part of data ops foundation, with an experience first approach

Dagger open source from 1 year

Screen Shot 2022-05-27 at 17.53.22.png

Q&A
- How feature eng can be viewed regarding autoML solutions?
  
  Not tackling this issue right now, but it can be explored
- Dagger vs Spark?
  
  Dagger under the hood uses Fling, in batch mode
  
  Dagger adds an abstraction layer for data scientists, allow them to have a more use friendly experience
- When ML Platform proposes new things, always some friction during user adoption. How did you approach 300 DS with dagger?
  
  Let make sure DS don’t need to learn new framework
  
  Don’t need to manage anything else
  
  Onboarding process for DS to play around with their feature is shorter

20. Cash App’s real-time ranking ML system 22. Data observability for ML team