3 Comments
User's avatar
samdiago's avatar

Great article! I really enjoyed the visual approach used to explain on-disk storage concepts. The breakdown of storage formats, blocks, and file organization makes a complex topic much easier to understand, especially for data engineers and database professionals. The discussion around Parquet, ORC, and other storage mechanisms provides valuable insight into how modern analytics platforms optimize performance and storage efficiency. A highly informative read for anyone looking to deepen their understanding of data storage fundamentals and big data architectures.

Jason's avatar

While this is a nice explanation of row vs columnar file formats and the concept of row-groups, it doesn't really say much about partitioning? Nor does it give any recommendations on how choices like row group size and file size impact cost and efficiency and how to optimize. I really appreciate that you are willing to tackle these complex topics but am also helpful this can go deeper than just simple concept explanations (we have a lot of those already).

Vinoo Ganesh's avatar

Thanks for the feedback, Jason! I've been debating about how shallow / deep to go on these topics and am still trying to figure out both my target audience, so I really appreciate your thoughts.

Here's what I'll do. I'm going to make this Part 1 in a series on partitioning. In subsequent posts, I'm going to dive more into these topics in a deeper and more hands-on level.

Thanks again and please keep the feedback coming!