<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Performance on Data Trenches</title><link>https://data-trenches.leandrof.space/tags/performance/</link><description>Recent content in Performance on Data Trenches</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><managingEditor>leandrojlfernandes@gmail.com (Leandro Fernandes)</managingEditor><webMaster>leandrojlfernandes@gmail.com (Leandro Fernandes)</webMaster><lastBuildDate>Mon, 10 Jun 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://data-trenches.leandrof.space/tags/performance/index.xml" rel="self" type="application/rss+xml"/><item><title>PostgreSQL 17 Beta: B-Tree just got promoted to Index CEO</title><link>https://data-trenches.leandrof.space/posts/postgresql-17-btree/</link><pubDate>Mon, 10 Jun 2024 00:00:00 +0000</pubDate><author>leandrojlfernandes@gmail.com (Leandro Fernandes)</author><guid>https://data-trenches.leandrof.space/posts/postgresql-17-btree/</guid><description>&lt;p>The release of &lt;a href="https://www.postgresql.org/about/news/postgresql-17-beta-1-released-2865/">PostgreSQL 17 beta&lt;/a> brought a bunch of new interesting features. Improvements to the vacuum execution time, memory consumption, faster ANALYZE, etc.., but the one that most databases and developers will appreciate that also caught my eye right off the bat, are the improvements to the B-Tree Index when using the IN or ANY clauses. I&amp;rsquo;ve read improvements ranging from 10% to 30% without any change to your database or table structure so I wanted to test it out for one of my production use cases.&lt;/p></description></item><item><title>PostgreSQL: A deep dive of Prepared Statements</title><link>https://data-trenches.leandrof.space/posts/postgresql-prepared-statements/</link><pubDate>Mon, 27 May 2024 00:00:00 +0000</pubDate><author>leandrojlfernandes@gmail.com (Leandro Fernandes)</author><guid>https://data-trenches.leandrof.space/posts/postgresql-prepared-statements/</guid><description>&lt;p>In a PostgreSQL database, preparing a statement involves parsing, analyzing and rewriting that specific statement. The result is compiled and stored in memory (which is usually called as statement caching). When a previous prepared statement is executed, PostgreSQL can skip the parsing, analyzing and rewriting steps and use the precompiled version instead. This can significantly improve database performance, especially for queries that are executed frequently or have complex plans. The more you execute a query, the bigger the probability for you to see significant improvements.&lt;/p></description></item><item><title>PySpark Infrastructure Optimization</title><link>https://data-trenches.leandrof.space/projects/pyspark-optimization/</link><pubDate>Sat, 17 Feb 2024 00:00:00 +0000</pubDate><author>leandrojlfernandes@gmail.com (Leandro Fernandes)</author><guid>https://data-trenches.leandrof.space/projects/pyspark-optimization/</guid><description>&lt;h2 id="the-challenge">The Challenge&lt;/h2>
&lt;p>Handling massive-scale data processing while maintaining reasonable query latency and managing compute resource costs in a distributed environment.&lt;/p>
&lt;h2 id="the-solution">The Solution&lt;/h2>
&lt;p>Architected distributed processing jobs using PySpark with multiple optimization strategies:&lt;/p>
&lt;ul>
&lt;li>Algorithmic improvements to reduce computational complexity&lt;/li>
&lt;li>Storage optimization using Trino and Hive&lt;/li>
&lt;li>Query execution plan optimization&lt;/li>
&lt;li>Resource allocation tuning&lt;/li>
&lt;li>Data partitioning strategies&lt;/li>
&lt;/ul>
&lt;h3 id="technologies-used">Technologies Used&lt;/h3>
&lt;ul>
&lt;li>PySpark&lt;/li>
&lt;li>Apache Hadoop&lt;/li>
&lt;li>Trino&lt;/li>
&lt;li>Hive&lt;/li>
&lt;li>Distributed Systems&lt;/li>
&lt;/ul>
&lt;h2 id="impact">Impact&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>25% reduction&lt;/strong> in query latency&lt;/li>
&lt;li>&lt;strong>25% decrease&lt;/strong> in resource consumption&lt;/li>
&lt;li>Improved processing efficiency for massive datasets&lt;/li>
&lt;li>Significant cost savings on compute resources&lt;/li>
&lt;/ul>
&lt;p>This optimization effort required deep understanding of distributed systems, Spark internals, and data storage patterns to achieve measurable performance gains.&lt;/p></description></item></channel></rss>