Data Engineering - Portfolio

Jhon Lucas - Data Engineer

About me

Master's student in Computer Science, working on research on the application of NLP techniques to assist in the translation of sentences into Libras (Brazilian Sign Language), B.Sc. in Computer Science, and postgraduate in Big Data and Machine Learning. I have worked in the technology field for over eight years and currently work as a Data Engineering Specialist at Spectral Finance (https://www.spectral.finance). I have experience as a data platform Chapter Lead, supporting data engineers with guidelines, templates/frameworks, and best practices, focusing on ETL processes, data integration between different sources, Data Lake, Data Warehouse, Data Architecture, and Infrastructure. Experience with technology used at Moderna Data Stack like Apache Spark, Apache Hudi, Delta Lake, Airbyte, Meltano, and orchestration platforms like Apache Airflow, using these technologies primarily on the AWS Cloud. Enthusiastic about Machine Learning and Deep Learning. Co-founder of the Data Train Community and AWS Community Builder.

To show my portfolio I'm going to use the STAR method, it's a method that I think is great for covering all the details of the project.

Projects

BI Platform - Retail

Situation: I was working on a project that involved defining business rules with the business team. We also needed to develop the data infrastructure necessary for the project's success. To accomplish these goals, we enlisted the help of a consulting company to guide us in defining the data modeling required for our Data Warehouse (DW).

Task: My primary responsibilities included cooperating with the business team to define the business rules and assisting with the deployment of services necessary for the project. Additionally, I needed to work with the consulting company to establish the data modeling for the DW, which included defining fact and dimension tables.

Action: To execute my tasks, I collaborated closely with the business team, translating their needs into concrete business rules. I assisted in deploying the necessary services for the project's infrastructure, ensuring it would be capable of handling the requirements of our work. Furthermore, I actively participated in the discussions and design process with the consulting company to effectively define the data modeling for our DW. This process involved determining the layout of our fact and dimension tables and using schema data modeling for our data warehouse.

Result: Through our joint efforts, we successfully defined the business rules and established a robust data infrastructure for the project. By collaborating with the consulting company, we also effectively determined the data modeling necessary for our DW, which now includes well-structured fact and dimension tables. These achievements have not only led to the successful realization of the project's objectives but also enhanced our data management capabilities.

Tech Stack:

With this project, we were able to start making decisions based on data, initiating a data-driven culture within the company. This shift was significant; we no longer relied on gut feelings for decision-making. Additionally, it helped us understand how we could improve both our business strategies and marketing campaigns. This, in turn, led to increased revenue and, consequently, profit.

Algorithm to reduce Stockout - Retail

Situation: Our company was facing a high number of product stockouts across all branches.

Task: My task was to develop an algorithm that could suggest product transfers between branches with the aim of reducing stockouts. This task had to be completed based on business rules defined by the business area. Additionally, I was to create a consolidated data view on stockouts to be consumed in a Data Visualization tool for increased dynamism.

Action: To tackle the situation, I first carried out an exploratory analysis to understand the nuances of the stockouts across all branches. With a thorough understanding of the business rules, I developed an algorithm capable of suggesting product transfers between branches. The algorithm was designed to balance stocks and significantly reduce instances of stockouts. Simultaneously, I worked on the creation of a consolidated data view or a "Big Table", which incorporated all information regarding stockouts. This table was specifically structured for compatibility with our Data Visualization tool, which facilitated a dynamic representation of the data.

Result: As a result of implementing the algorithm, we experienced a noticeable decrease in product stockouts across all branches. This strategic reallocation of products based on data-driven decisions substantially improved our inventory management. Additionally, the creation of the consolidated data view facilitated easy consumption of stockout data in a more visually appealing and accessible format, enhancing our ability to quickly understand and react to stockout patterns.'

Tech Stack:

RFM (recency, frequency, monetary) Analysis

Situation: Our company needed an effective approach to customer segmentation that would improve our marketing efficiency and sales outcomes.

Task: My task was to employ RFM (Recency, Frequency, Monetary Value) analysis for customer segmentation and to conduct cluster analysis on our customer database. This segmentation was intended to better target our marketing efforts and subsequently boost sales.

Action: I began by collecting and preparing all the necessary customer data for the RFM analysis. Based on the customers' most recent purchases (Recency), purchase frequency (Frequency), and the total money spent (Monetary Value), I scored each customer and assigned them to various segments. After completing the RFM segmentation, I further carried out cluster analysis to group customers with similar buying behaviors. These groups allowed us to tailor our marketing strategies to each segment's unique characteristics and needs.

Result: The application of RFM segmentation and cluster analysis to our customer base led to notable improvements in our marketing efforts. Our marketing team was able to make more targeted advertisements, resulting in more effective campaigns and a significant increase in sales. The customer segmentation provided a clearer understanding of our customer base, enabling us to better serve their needs and boost our company's performance.

Tech Stack:

Data Lakehouse Platform - Marketplace

Situation: Our company sought to become a data-driven organization. We identified the need to construct a data lakehouse platform on AWS, which would enable the business to access crucial data for decision-making and insights. This project involved handling more than 30 databases, over 300 tables, and storing more than 35TB of data on S3. Also in this project, we were handling third-party data sources, like rest API and GrapQL.

Task: My task was to contribute to the creation of this data lakehouse platform. This work involved utilizing a range of technologies, including Spark on EMR with Hudi as the table format, Athena, ECS to run Python scripts for retrieving data from Rest APIs, Airbyte for additional data extraction from Rest APIs, Airflow, S3, and Power BI on Azure for BI analysis.

Action: I began by setting up Spark on AWS EMR and choosing Hudi as our table format to handle our large and complex data ecosystem. I then configured Athena to enable interactive queries on the data. To gather data from Rest APIs, I implemented Python scripts on ECS and used Airbyte. For orchestrating and scheduling our tasks, I incorporated Airflow into our tech stack. The data collected was stored in S3, and Power BI on Azure was used to facilitate BI analysis, providing a more digestible view of the data. I worked closely with my team to ensure that the process was seamless and that the data was accurately represented and easy to access.

Result: The successful completion of the data lakehouse platform marked a significant milestone for our company. We were not only able to handle a vast amount of data effectively but also made it accessible for business insights and decision-making purposes. The platform has empowered our company to become a truly data-driven organization, as the readily available data now plays a critical role in our business strategies. As a result, we've seen improvements in our decision-making process, which has led to more targeted strategies and better business outcomes.

Tech Stack:

Data Ingestion Pipeline - From DynamoDB to Data Lakehouse

Situation: Our marketplace platform allows clients to rapidly publish their products via API and retrieve information about their products. However, our business team wanted to summarize the event data generated in DynamoDB to perform certain analyses.

Task: My task was to design and implement a pipeline that could extract data from DynamoDB and load it into our Data Lakehouse, making it accessible for analysis.

Action: To achieve this goal, I leveraged AWS Kinesis Streams to capture the real-time data from DynamoDB. I used Kinesis Analytics to process the data streams, extracting the useful insights that our business team needed. Then, I stored this data in S3, which acts as our Data Lakehouse. Finally, to make this data accessible and easily analyzable, I employed Athena, a service that enables users to perform SQL queries directly on data stored in S3.

Result: The implementation of this pipeline significantly enhanced our data accessibility and usability. The business team can now summarize and analyze event data directly from DynamoDB, leading to more data-driven decisions and improved business strategies. By using real-time data processing tools, we ensured that the data is always up-to-date, providing an accurate representation of the current state of affairs. The application of Athena has also made data querying a lot simpler, allowing users to focus on deriving insights rather than data extraction.

 

Webhook to listen to CRM Events and insert them into Data Lake

Situation: Our Sales team needed real-time insights from CRM events to make informed decisions. However, we lacked a system that could listen to CRM events and provide these insights instantaneously.

Task: My task was to create a webhook to listen to the events from the CRM, process these events, and store the data in our Data Lakehouse stored on S3. The tech stack for this project included the Serverless framework, AWS Lambda, Flask, and S3.

Action: I started by setting up a Flask application to create the webhook for listening to CRM events. The Serverless framework was used to enable the application's deployment and management without having to worry about the server infrastructure. When an event was detected, an AWS Lambda function was triggered to process the event data. The processed data was then stored in S3, our data lakehouse, making it accessible for further analysis and real-time insights.

Result: With the successful implementation of this project, the Sales team was able to get real-time insights from the CRM events. This has greatly improved their decision-making process, as they now have access to the latest data when they need it. The use of serverless technologies and AWS services has made this solution highly scalable and efficient, ensuring we can handle the growing needs of our Sales team.

Tech Stack:

Getting data from Twitter - parsing hashtags

Situation: As part of an academic project, I was tasked with gathering data from Twitter, focusing on specific hashtags. The goal was to gain insights about these hashtags and also to put my knowledge of Apache Nifi and StreamSets to the test.

Task: My task was to design and implement a data pipeline to extract tweets containing the target hashtags, process this data, and generate meaningful insights. The project would involve leveraging Apache Nifi and StreamSets tools.

Action: I initiated the project by setting up data extraction from Twitter using Apache Nifi. I carefully configured the tool to focus on tweets containing our specific hashtags of interest. Then, I used StreamSets to process the incoming data stream and prepare it for analysis. Given that this was an academic project, I spent considerable time exploring the functionalities and capabilities of both Apache Nifi and StreamSets, focusing on both efficient data extraction and data processing.

Result: Despite being an academic exercise, the project offered significant learning experiences. I was able to successfully gather data from Twitter regarding specific hashtags and process it for insightful analysis. In addition, the project enabled me to gain practical experience and deepen my understanding of Apache Nifi and StreamSets. Ultimately, this project enhanced my skills in using these tools, setting a strong foundation for future data pipeline projects.

Apache NiFi

StreamSets

WorldCloud using Twitter Data

Design and Implement the Data Lakehouse

Situation: The company needed a robust data platform, relying only on a NoSQL database for analytics, which needed to be improved for promoting a data-driven culture and facilitating data consumption by data scientists and analysts. The necessity to incorporate more data sources, including data from blockchain, lending protocols, and cryptocurrency platforms, while reducing costs, underscored the need for a comprehensive solution.

Task: My role was to design and implement a Data Lakehouse Platform to meet these requirements. This involved deciding on the open table format to be used, designing the architecture, and justifying the need for a data lakehouse to the team.

Action: To start, I conducted several proof-of-concept tests to decide on the open table format that best suited our needs. Following that, I designed the architecture of the Data Lakehouse platform, ensuring it could accommodate our current data sources and scale for future additions. I presented the design to the team, explaining the advantages of having a Data Lakehouse and how it would facilitate data access for data scientists and analysts. I then proceeded to implement the platform, ensuring the smooth integration of data from various sources, including blockchain, lending protocols, and cryptocurrency platforms.

Result: As a result of this project, we were able to establish a functional Data Lakehouse platform, enabling a data-driven culture within the company. The platform facilitated easy data access for data scientists and analysts, promoting more efficient and informed decision-making processes. Furthermore, we successfully incorporated data from various news sources while keeping costs low. The project significantly upgraded our data infrastructure, allowing for scalable and cost-effective data operations.

Data Ingestion Wallet Transactions

Situation: Our organization required comprehensive transaction data from the Ethereum blockchain for various analyses. The available API from Etherscan was not providing data at the speed we required, hence we needed an in-house solution. Additionally, we wanted to apply data modeling to identify different types of transactions and create a balance summary for different wallets and coins.

Task: My task was to design and implement a solution for extracting and processing Ethereum blockchain data using Python, AWS ECS, Athena, EMR Serverless, S3, and Airflow. The project also involved data modeling and the creation of several aggregated tables.

Action: I started by setting up a Python-based data extraction system on AWS ECS to gather Ethereum blockchain data. For data processing and analysis, I used Athena and EMR Serverless. The data was then stored in S3, which served as our data lake. I implemented different data modeling techniques to categorize the transactions into normal, internal, and erc20 types. Furthermore, I created aggregated tables to represent coin balances and wallet balances based on the processed data. I used Airflow to orchestrate and schedule the various data processing and ETL tasks.

Result: This project led to significant improvements in the speed and accuracy of our data collection and analysis related to Ethereum blockchain transactions. By bypassing the limitations of the Etherscan API, we could get real-time data at the pace we required. The data modeling provided us with a detailed view of different transaction types, and the aggregated tables ensured we had quick access to wallet and coin balances. This greatly enhanced our ability to analyze and understand blockchain transactions, leading to more informed decision-making processes within our organization.

Complete Data Architecture

This project was the one in which we utilized our data lakehouse platform the most. As such, we had to adhere to numerous best practices to ensure low costs, high performance, and the availability of data for the data science team.

Best Practices that it was applied in the data lakehouse architecture:

This project also presented a great opportunity for me to learn more about blockchain data because I had to deploy an Ethereum Node. To do this, I had to test different clients to determine the best fit for our needs. In this case, we chose Erigon as the client because we needed to have an archive node. All data from traces was necessary for us. Currently, we are using it just to get internal transactions, but the future goal is also to decode and identify events for each transaction (e.g., TransferEvent from AAVE or Compound).

Data Ingestion Polygon Subgrap (The Graph) - DeFi Events

Situation: We needed to fetch data from The Graph, specifically DeFi data from the Polygon subgraph. The data of interest were events generated by lending protocols such as Aave, Compound, and Avalanche.

Task: My task was to set up a reliable data extraction and processing pipeline to get this specific DeFi data from the Polygon subgraph on The Graph.

Action: I started the project by configuring the extraction parameters to focus on the DeFi events of interest from lending protocols like Aave, Compound, and Avalanche. I utilized appropriate tools and scripts to ensure the extraction was accurate and efficient. The data was then processed and transformed as per our analytical needs.

Result: Through the successful implementation of this project, we were able to effectively fetch and process relevant DeFi data from the Polygon subgraph. This greatly enriched our data resources and enabled us to conduct more detailed and specific analyses on the events of the various lending protocols. The insights gathered from these analyses significantly contributed to our understanding of the DeFi landscape and facilitated more informed decision-making.

Getting Unique EOA Address from DeFi Events

Situation: We had a project that required obtaining unique EOA (Externally Owned Accounts) addresses based on all DeFi events from AAVE V1, V2, and Compound. The aim was to construct a table that included the wallet address, the timestamp of the first and last transactions, and the last protocol the wallet interacted with, and restricted to one row per wallet. This task was particularly challenging due to the need for upserts in our data lakehouse, a functionality typically not available.

Task: My task was to design and implement a data extraction, processing, and storage pipeline to fulfill this unique requirement. The solution had to utilize Apache Hudi as an open table format, given its ability to handle upserts effectively in a data lakehouse context.

Action: I began by setting up the data extraction mechanism to fetch the specific DeFi events from AAVE V1, V2, and Compound. The extracted data were processed to derive unique EOA addresses, the timestamp of the first and last transactions, and the last protocol each wallet interacted with. I then used Apache Hudi to store this data in the data lakehouse, leveraging its upsert functionality to manage data updates and insertions effectively.

Result: The successful completion of this project led to the creation of a table that served our specific needs, enabling us to track unique EOA addresses based on DeFi events efficiently. Using Apache Hudi proved vital for handling upserts in our data lakehouse, thereby making it possible to update and insert data in an efficient and practical manner. This project enriched our data resources and boosted our ability to conduct detailed analyses of DeFi events and wallet interactions.

Data Pipeline:

Solutions Architecture:

Deploy MWAA in a private network

Situation: We had an issue with the current configuration of our Managed Workflows for Apache Airflow (MWAA) provided by AWS. It was set up using less-than-optimal networking practices, leading to high data traffic costs and security concerns. It was accessible outside a private network, which was not in line with our security guidelines.

Task: My task was to reconfigure the MWAA to operate within a private network accessible only through a VPN, aiming to increase security and reduce costs associated with data traffic.

Action: I started the project by assessing the current MWAA setup and identifying the key areas causing high data traffic costs and potential security vulnerabilities. The main issue was the use of a NAT gateway which was contributing to the high costs and wasn't necessary for our use case. I then reconfigured the MWAA to operate within a private network and replaced the NAT gateway with an Internet Gateway, which was adequate for our needs.

Result: This project led to a significant reduction in data traffic costs by nearly 90%. We increased the security of our MWAA by ensuring it was only accessible through a VPN, reducing the risk of unauthorized access. The implementation of the Internet Gateway instead of the NAT gateway served our needs adequately, proving that more complex and expensive solutions are not always better. This project demonstrated the importance of

optimal configuration in managing costs and maintaining high-security levels.

MWAA Architecture

Observability Alerts for Lambdas

Situation: We had more than 20 AWS Lambda functions operating as part of our infrastructure. However, we lacked a system to notify us on Slack whenever any of these Lambda functions failed, creating a gap in our monitoring and observability practices.

Task: My task was to create an observability stack that would alert us on Slack whenever a Lambda function failed. This new system would increase our visibility into system operations and allow us to respond more quickly to failures.

Action: I started the project by understanding the existing Lambda functions and identifying the data necessary for useful alerts. Then, I utilized Python to script a solution that would track Lambda function execution and trigger an alert message to be sent to Slack if any function failed. This message included details about the function and the nature of the failure to assist with troubleshooting.

Result: The successful implementation of this project resulted in significantly improved monitoring of our Lambda functions. The real-time Slack alerts allowed our team to immediately respond to any function failures, reducing potential downtime and improving system reliability. This observability stack has become a crucial part of our infrastructure management, improving our ability to maintain consistent service quality.

Workflow

More info

More about the technologies I use as a Data Engineer:

Conclusion

Here's a bit about the projects I've worked on throughout my career as a Data Engineer. I'm always striving to generate business value using various technologies. Ultimately, what's important is helping the business to grow and achieve its goals, irrespective of the technology we are using. If we are achieving the goal, that's what should be prioritized in every project, especially in data projects, because data can significantly impact the business.

My LinkedIn profile: https://www.linkedin.com/in/jhon-lucas/

My GitHub: https://github.com/l-jhon

Resources

https://docs.aws.amazon.com/athena/latest/ug/what-is.html

https://airflow.apache.org/

https://www.reddit.com/r/ethstaker/comments/pdpj62/for_anyone_who_hasnt_checked_out_erigon_yet_i/

https://airbyte.com/blog/data-lake-lakehouse-guide-powered-by-table-formats-delta-lake-iceberg-hudi