<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Big Data on Data Trenches</title><link>https://data-trenches.leandrof.space/tags/big-data/</link><description>Recent content in Big Data on Data Trenches</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><managingEditor>leandrojlfernandes@gmail.com (Leandro Fernandes)</managingEditor><webMaster>leandrojlfernandes@gmail.com (Leandro Fernandes)</webMaster><lastBuildDate>Sat, 17 Feb 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://data-trenches.leandrof.space/tags/big-data/index.xml" rel="self" type="application/rss+xml"/><item><title>PySpark Infrastructure Optimization</title><link>https://data-trenches.leandrof.space/projects/pyspark-optimization/</link><pubDate>Sat, 17 Feb 2024 00:00:00 +0000</pubDate><author>leandrojlfernandes@gmail.com (Leandro Fernandes)</author><guid>https://data-trenches.leandrof.space/projects/pyspark-optimization/</guid><description>&lt;h2 id="the-challenge">The Challenge&lt;/h2>
&lt;p>Handling massive-scale data processing while maintaining reasonable query latency and managing compute resource costs in a distributed environment.&lt;/p>
&lt;h2 id="the-solution">The Solution&lt;/h2>
&lt;p>Architected distributed processing jobs using PySpark with multiple optimization strategies:&lt;/p>
&lt;ul>
&lt;li>Algorithmic improvements to reduce computational complexity&lt;/li>
&lt;li>Storage optimization using Trino and Hive&lt;/li>
&lt;li>Query execution plan optimization&lt;/li>
&lt;li>Resource allocation tuning&lt;/li>
&lt;li>Data partitioning strategies&lt;/li>
&lt;/ul>
&lt;h3 id="technologies-used">Technologies Used&lt;/h3>
&lt;ul>
&lt;li>PySpark&lt;/li>
&lt;li>Apache Hadoop&lt;/li>
&lt;li>Trino&lt;/li>
&lt;li>Hive&lt;/li>
&lt;li>Distributed Systems&lt;/li>
&lt;/ul>
&lt;h2 id="impact">Impact&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>25% reduction&lt;/strong> in query latency&lt;/li>
&lt;li>&lt;strong>25% decrease&lt;/strong> in resource consumption&lt;/li>
&lt;li>Improved processing efficiency for massive datasets&lt;/li>
&lt;li>Significant cost savings on compute resources&lt;/li>
&lt;/ul>
&lt;p>This optimization effort required deep understanding of distributed systems, Spark internals, and data storage patterns to achieve measurable performance gains.&lt;/p></description></item></channel></rss>