Alvin EndratnoforAlvin Endratno's Blogalvinend.hashnode.net·Sep 9, 2022Setup Jupyter in EC2 and Apache Spark with Delta Lake connection to S3Delta lake has been booming for the last two years after Databricks announce it as "New Generation Data Lakehouse," but behind the boom, there are not enough examples and posts of it. I want to change it by adding one article about it. This time we w...Discuss·2 likes·203 readsDelta Lakelakehouse
Satish SutarforCloud, DevOps and Open Sourcesatishsutar-cloud-and-devops.hashnode.net·Sep 17, 2022Storages typesChoose the best storage type based on your requirements between object storage and block storage. Object storage: Object storage is a data storage architecture for storing unstructured data that sections data into units—objects—and stores them in a s...Discuss·1 like·56 readsstorage
Alvin EndratnoforAlvin Endratno's Blogalvinend.hashnode.net·Sep 22, 2022Using SQL to Query Data with Delta LakeLast time, we set up Jupyter in EC2 and Apache Spark with Delta Lake connection to S3. We will import data from the dataset and query it with SQL this time. About Dataset For this experiment, we will use a dataset about courses, students, and their i...Discuss·68 readsDelta Lakebig data
PWCforFailures Are Inevitablepwc.hashnode.net·Mar 19, 2023Data Engineering Cake: Layers of a Data Management PlatformStephanie Astono Salim, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons Introduction I write this article from a backend software developer (APIs & microservices). I found that in many domains, some sort of data pl...Discussdata-engineering
Tobias Müllerfortobilg.comserverlessly.hashnode.net·Feb 26, 2023Using DuckDB to repartition parquet data in S3Since release v0.7.1, DuckDB has the ability to repartition data stored in S3 as parquet files by a simple SQL query, which enables some interesting use cases. Why not use existing AWS services? If your data lake lives in AWS, a natural choice for ET...Discuss·1.3K readsduckDB
Sneh Bhattformytwocents.hashnode.net·Feb 24, 2023AWS concepts and ideas - ENABLING CONCURRENT WRITES ON S3 DATA LAKEAbstract Amazon S3 is an object store that provides scalability to store any amount of data, and customers leverage S3 to build a data lake. Being an object store, S3 has limitations when it comes to managing concurrent writes on the same data (think...Discuss·36 readsAWS concepts and ideasAmazon S3
Jonathan ReisforJonathan Reis's blogjreissup.hashnode.net·Feb 23, 2023Implementing a Data Lakehouse Architecture in AWS — Part 3 of 4Introduction In our previous article, part 2 of the series, we walked through the extraction, processing, and creation of some data mart, using the New York City taxi trip data which is publicly available to do consumption. We used some of the princi...DiscussExploring the Data Lakehouse and Its Implementation in AWSData-lake
Jonathan ReisforJonathan Reis's blogjreissup.hashnode.net·Feb 23, 2023Implementing a Data Lakehouse Architecture in AWS — Part 2 of 4Introduction In part 1 of this article series, we walked through how to feed a Data Lake built on top of Amazon S3, based on streaming data, using Amazon Kinesis. In part 2, we will cover all of the steps needed to build a Data Lakehouse, using trip ...DiscussExploring the Data Lakehouse and Its Implementation in AWSData-lake
Sujal MaitiforGyaansujal.hashnode.net·Feb 7, 2023"Art of Managing & Working around Data: DataLake"What is Data Lake? A centralised storage system called a "Data Lake" is used to store all the unprocessed data that is ingested from various sources. It can scale up to accommodate storing all of the enterprise's data. It can keep data of different t...Discussdata-engineering
Mike Kenneth HoungbadjiforMike's Blogmikekenneth.hashnode.net·Feb 4, 2023Building a Data Lakehouse for Analyzing Elon Musk Tweets using MinIO, Apache Airflow, Apache Drill and Apache SupersetEvery act of conscious learning requires the willingness to suffer an injury to one's self-esteem. That is why young children, before they are aware of their own self-importance, learn so easily.Thomas Szasz Motivation A Data Lakehouse is a modern d...Discussapache-airflow
Anuj SyalforAnuj Syal's Bloganujsyal.hashnode.net·Dec 24, 2022Data Engineering ExplainedWhen we scroll through these sites in hopes to find something we need to buy (say, a shirt), we add it to the cart, or we just let it be saved for later. Within a few moments, you begin to see advertisements of the same or similar-looking shirts whil...Discuss·76 readsData Science
Harsh DaiyaforHarsh Daiya's Bloghd.hashnode.net·Dec 22, 2022Data Lake on AWSA data lake is a central repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and v...DiscussData-lake
Alvin EndratnoforAlvin Endratno's Blogalvinend.hashnode.net·Sep 22, 2022Using SQL to Query Data with Delta LakeLast time, we set up Jupyter in EC2 and Apache Spark with Delta Lake connection to S3. We will import data from the dataset and query it with SQL this time. About Dataset For this experiment, we will use a dataset about courses, students, and their i...Discuss·68 readsDelta Lakebig data