Performance on Data Trenches

PostgreSQL 17 Beta: B-Tree just got promoted to Index CEO

leandrojlfernandes@gmail.com (Leandro Fernandes) — Mon, 10 Jun 2024 00:00:00 +0000

The release of PostgreSQL 17 beta brought a bunch of new interesting features. Improvements to the vacuum execution time, memory consumption, faster ANALYZE, etc.., but the one that most databases and developers will appreciate that also caught my eye right off the bat, are the improvements to the B-Tree Index when using the IN or ANY clauses. I’ve read improvements ranging from 10% to 30% without any change to your database or table structure so I wanted to test it out for one of my production use cases.

PostgreSQL: A deep dive of Prepared Statements

leandrojlfernandes@gmail.com (Leandro Fernandes) — Mon, 27 May 2024 00:00:00 +0000

In a PostgreSQL database, preparing a statement involves parsing, analyzing and rewriting that specific statement. The result is compiled and stored in memory (which is usually called as statement caching). When a previous prepared statement is executed, PostgreSQL can skip the parsing, analyzing and rewriting steps and use the precompiled version instead. This can significantly improve database performance, especially for queries that are executed frequently or have complex plans. The more you execute a query, the bigger the probability for you to see significant improvements.

PySpark Infrastructure Optimization

leandrojlfernandes@gmail.com (Leandro Fernandes) — Sat, 17 Feb 2024 00:00:00 +0000

The Challenge

Handling massive-scale data processing while maintaining reasonable query latency and managing compute resource costs in a distributed environment.

The Solution

Architected distributed processing jobs using PySpark with multiple optimization strategies:

Algorithmic improvements to reduce computational complexity
Storage optimization using Trino and Hive
Query execution plan optimization
Resource allocation tuning
Data partitioning strategies

Technologies Used

PySpark
Apache Hadoop
Trino
Hive
Distributed Systems

Impact

25% reduction in query latency
25% decrease in resource consumption
Improved processing efficiency for massive datasets
Significant cost savings on compute resources

This optimization effort required deep understanding of distributed systems, Spark internals, and data storage patterns to achieve measurable performance gains.