Learn Python the Hard Way, 5th Edition (2023-2024): 51: What is Data Munging?

內容

At this point in the course you know Python. You may not feel confident in Python, and you don't know all of Python, but neither do most of the people using Python. There are supposed professionals who actually don't know you can use dis() to study the Python bytecode. These "professionals" also have no idea that Python even has bytecode. Given that you know how to analyze the bytes Python uses to process your code I'd say you could be more knowledgeable than many Python programmers working today.

Does that mean you're good at Python? No, not at all. Memorizing arbitrary facts about programming languages does not make you capable with that language. To become a capable programmer you have to combine your understanding of how Python works with actually using it to build software. Programming is a creative practice similar to Music, Writing, and Painting. You can memorize every note on the fretboard but if you can't actually play those notes you don't know how to play guitar. You can memorize every rule of English Grammar but if you can't actually write a compelling story or essay then you can't write. You can memorize every quality of every pigment but if you can't use those pigments to paint a portrait then you can't paint.

The goal of the final module is to take you from "I know about Python" to "I can create software with Python." I'm going to teach you how to convert the ideas in your head into working software, but I must warn you, this process is very frustrating. Many beginners find it difficult to even express their ideas well, let alone well enough to create software. The way you become better at expressing your ideas in code is through experience. You simply have to do it over and over again until it's easy to do. That's why it's so frustrating to learn because it feels like you're making no progress until finally you do.

To accomplish this goal I'm going to present three things to you in the next 6 exercises:

  1. An abstract or poorly defined challenge to solve. Don't take these challenges as attempts to trick you like a bad job interview. I'll tell you the secret any secrets I think you need. Take the challenges as being "loose" so you have freedom to find your own solution. Since I don't give you an exact problem, I don't expect any specific solution.
  2. A new advanced Python concept to incorporate into your solution. I suggest creating a first version of your solution any way you can, and then doing a new version that uses the new Python concept.
  3. More technologies to explore that might make the problem easier, or are related to the topic. Being able to explore new technologies is important as a programmer, but it's also half the fun sometimes.

In this first exercise I'm going to also describe a process for taking your ideas and turning them into code. It's important you read this process carefully and use it until you feel confident in your own skills. After you're comfortable with the process you can modify it to suit how you work, or experiment with new ways to turn your ideas into code.


Page 2

The two most popular uses of Python are Data Science and Web Scraping because web scraping typically feeds data to your data science pipeline. If you have an application that needs beer sales then scraping it off the ATF TTB website is probably your only solution. If you need to train a GPT model on text then scraping it off various forum websites is a good option. The web has so much data available, it's just in unfriendly visual formats.

Web scraping is also a great beginner topic for many of the same reasons as data munging:

  1. It's something everyone understands because they use browsers all day long. Most people have some concept of what a web page is.
  2. Web scraping doesn't require a ton of theory or computer science knowledge. You just need a way to get a web page and parse HTML raw for what you want.
  3. It's easy to manually download a page you want to study and then work on it for as long as you want.
  4. Just like data munging, the code is almost never "elegant." You're free to create the worst hacks possible to get it working and then refine later.
  5. It's also a very important part of many data science projects. Data science needs data. The web has a ton of data.
  6. Web scraping also leads you to automated testing of web applications, so you can do double education by learning it.

Page 3

In this exercise you'll access the Application Programmer Interface (API) I use for my learncodethehardway.com website. In web development an API is usually a combination of the following:

  1. A web server that you can access with the HTTP protocol. You used HTTP when you used urllib to get the beer production PDFs from the ttb.gov website. HTTP is also what your browser uses to display the web application to you.
  2. This web server responds in some data format that's easily parsed. This is what differentiates a PDF from an API. Sure, you're getting data on beer production from ttb.gov but you have to parse that data out of a PDF. An API gives you the data ready to go in a format that loads directly into your application with no manual parsing.
  3. A higher level API will provide features to discover how the API works automatically. This is a more advanced feature of APIs but many of them will have an initial URL that describes the API, and then each piece of data will describe what's allowed and link to related elements. There is no official standard on how this is done, but if it's available it's nice to have.

I use an API in my web application that conforms to #1 and #2, but only partially #3 since I don't actually care if other people can dynamically figure out how to use my API. #3 on this list is a common practice since private APIs are made for a specific application written by the API owners, while public APIs are intended for anyone to use and discover. I chose my private API because many times those are the most useful because other people are too lazy to reverse engineer them.

Please do not download the raw video or HTML files for the course since that will most definitely crush my little web server. This also vioalates TOS. You are only allowed to access the public (unauthenticated) JSON API described here.


Page 4

We'll now explore Pandas which is the main way data scientists work with data. It's also a useful project outside of data science so it's worth using no matter what your future holds. The main thing Pandas provides is data conversion and a DataFrame structure which is used by many statistics and mathematics applications. We'll explore the concept of a DataFrame in the later exercises.

In this exercise you'll use Pandas to take the CSV file you created to output various formats for your bosses. You'll also use a tool called pandoc to generate this report. You don't have to use pandoc but it is an insanely useful tool for doing reports in various formats.

Introducing Pandoc

Pandoc's job is to take one text format and convert it to another format. It can take markdown files and convert them to HTML, PDF, ePub, and many other formats. This means you can write reports in nice easy to write markdown and then convert them to any format that's required for the job. Need to submit a LaTeX file to a journal? Pandoc. Need to submit HTML to the web server team? Pandoc.


Page 5

This exercise is going to teach two very important skills. First, you'll learn about Pandas and its DataFrame construct. This is the most common way to work with data in the Python Data Science world. Second, you're going to learn how to read typical programmer documentation. This is a far more useful skill as it applies to every single programming topic you will ever encounter. In fact, you should think of this exercise as using Pandas to teach you how to read documentation.

For this exercise you are free to switch back to Jupyter to make exploration and documenting what you learn easier. If you then want to make a project using Pandas you can take what you learn with Jupyter to create it.

There's a concept in painting called "the gestalt." The gestalt of a painting is how all of the parts of a painting fit together to create a single cohesive experience. Imagine I paint a portrait of you and create the most perfect mouth, eyes, nose, ears, and hair you've ever seen. You see each part is perfect and then you pull back, and when placed together...they're all wrong. The eyes are too close together, the nose is too dark compared to everything else, and the ears are different sizes. On their own, they're perfect, but when combined into a finished work of art they're awful because I didn't also pay attention to the gestalt of the painting.

For something to have high quality you have to pay attention to the qualities of each individual piece, and how those pieces fit together. Programmer documentation is frequently like this awful portrait with perfect features that don't fit together. Programmers will very clearly and accurately describe every single function, the nuances of every option to those functions, and every class they made. Then completely ignore any documentation that describes how those pieces fit together or how to use them to do anything.

This kind of documentation is everywhere. Look at Python's original sqlite3 documentation then compare it to the latest version that finally has how to use placeholders. That's a fairly important topic you need for good security and it's...just casually ignored for about a decade?

Learning from this documentation requires a particular style of reading that's more active. That's what you will learn in this exercise.


Page 6

In this exercise you'll take the mess of random scripts you've made and create one clean tool that uses only Pandas for the entire process. You'll do this for both the TTB beer statistics, and the video watch time for my website.

Make a Project

It's time to get cleaned up and make a project for this exercise. You don't need to install this project, but it should have all the required project files, automated tests, a README.md and the scripts necessary to run your tools


Page 7

You can't do science without data, and the most widely used language for storing and managing data is SQL. Many "no-SQL" databases have some language that looks quite a lot like SQL. That's because--for all its faults--SQL is a fairly well thought out language for specifying the storage, querying, and transformation of data. Learning SQL basics can only help you in data science, but there's another important reason why I feel SQL is a great way to end the course:

I don't want this course to only be about Data Science. I use Data Science and Python as a theme to teach the basics of programming. They are simply tools that help me with my goal of teaching you how to use a computer to express your thoughts and ideas.

SQL shows its face in every part of the technology industry and in many personal projects. Your phone has a 100% chance of having numerous SQLite3 databases on it. Your computers all have SQLite3 databases on them. You find SQL in web applications, desktop applications, phone applications, and even in video games. If it's not in an application you install there's most likely a SQL database somewhere between you and some other computer on the internet. Even if something doesn't use a SQL database, it is most likely using something that is very similar to one.

That means learning SQL will not only benefit you as a Data Scientist, but it'll also benefit nearly every aspiring programmer no matter what journey they take in the medium.


Page 8

In the previous exercise we explored SQL basics using the European Central Bank's historic Euro data set. In this exercise I'm going to teach you about data modeling by reshaping this data into multiple tables to "normalize" it.

What is Normalization

Normalization is about reducing redundancy in your data set. You see some form of redundancy, move it into a separate table, and then link the two tables via an id column. It gets far more complex and theoretical, but this is the general idea. Doing this has a few advantages:

  1. It reduces the size of your data, and reduced size generally improves performance (but not always).
  2. It helps you understand the structure of the data possibly giving you better insights into better analysis.
  3. It makes many queries faster because you can narrow searches to specific data you want, rather than always searching all of it (but not always).
  4. It makes it easier to augment the data later since you can change the contents of a small isolated table rather than trying to change a giant table.
  5. It helps find errors in analysis since you're forced to explain how two pieces of data should be related. Does a User have one purchase or many purchases? Does that mean a purchase has many users or only one user? Normalization highlights these kinds of mistakes and forces you to formalize an answer.
  6. It makes you look like a real professional because you know what the word "normalization" means.

When you normalize a database you follow a process that goes through different "levels" or "normal forms" of quality:

  1. First Normal Form (1NF) has the goal of making 1 row and 1 column for every type and piece of data.
  2. Second Normal Form (2NF) has the goal of moving redundant discrete data into separate tables based on their relationship to keys in the table.
  3. Third Normal Form (3NF) requires that every piece of information in a row is only about the key of that row. This is where most people stop with normalization as further normalization can make the data more complicated than it needs to be for your application.

Let's take the ECB table and walk through normalizing it to second normal form (2NF). Going to third normal form (3NF) is not too useful in this data set.


Page 9

Our final exercise is going to cover the concept of relations in SQL. In technical terms every table is a relation, but we're going to be more specific and talk about tables that are connected to other tables in various ways.

One-to-Many (1:M)

A "relation" in SQL is a method of using id columns in tables to associate one table to another through a "one to many" or "many to many" relationship. In our ECB data we have a rate for each country, and a currency that rate applies to. We can say the following about this relationship between rate and currency:

"A Rate has one Currency, and a Currency has many Rate."

In our 2NF version of the ECB data, the first part is modeled by placing a currency_id in the rate table, so that each rate row has only one currency.id. This also implements the "Currency has many Rates" side since any query for a currency.id in rate.currency_id would pull up all the daily rates for that one currency.

One-to-Many in Python

I find it helps to understand these concepts if you see how they're typically implemented in Python. If I wanted to say, "Rate has one Currency" in Python I'd do this:


Page 10

Imagine it's 1820 and you want a nice portrait of your mother. You hear that paintings in pastel are all the rage and can be done quickly while still looking beautiful, especially in the candlelight you use to light your home at night. You contact an artist, they come to your home, do some initial sketches of your mother, and then schedule return visits to complete the painting. Since the artist uses pastel they can finish a very nice portrait in a record 6 hours of sitting, and make your mother look younger too. It also only costs you a week's salary which is a bargain compared to an oil painting. Those are very expensive and can take months.

Decades pass and your children want to have a nice portrait of you. It's 1840 and your children sign you up to sit for a photograph! It's so exciting because they look so real and they're so easy. You go to the photographer's studio, sit in a chair wearing your finest clothes, and the photographer takes the photo. The whole process takes maybe 30 minutes, with the photo taken in an instant. Within a few years even more ways to take photos are invented, and within a few decades photography begins to completely change the world, for better and worse. Eventually the pastel of your mother is long forgotten.

Today you (not the 1820s you) live in a world that Photography made possible. You are looking at this course either on a computer screen that is a direct descendent of the early cameras, or on a book that was printed using cameras. Your computer is also a direct descendent of photography, with the original process to create a CPU utilizing a process similar to developing film. Not only that, but your computer would not exist without the ability to utilize photography to exchange schematics, designs, documents, and many other artifacts necessary to construct all the equipment to make it. You are also most likely alive because of photography and painting, which helped pioneer modern chemistry manufacturing by companies like Bayer. Without the industrialized chemistry perfected on pigments you would not have aspirin, antibiotics, x-rays, and photographs of DNA.

I firmly believe that Photography created the modern world, and I believe that you are currently standing on the edge of a similar revolution in computing with the recent invention of Generative AI. It's very early, but technology such as Large Language Models and Stable Diffusion are already useful technologies and only getting better. Eventually these technologies will feed into even better and more efficient technologies, in much the same way Photography created the silicon wafers that now power the sensors in modern cameras. If these technologies continue to advance then what happens to programmers?

Probably the same thing that happened to painters when Photography sufficiently advanced. Before photography you had to hire an artist if you wanted a memory of your mother, and all of those artists were out of work within one generation. Now it's odd to find an artist who can accurately paint a portrait. I believe programming will be very similar, where it will be odd to find a programmer who can actually code something from scratch without help.

If that's the case, then why bother learning to code? For the same reason I learned to paint a realistic portrait:

There is more to programming than just getting paid to turn buttons cornflower blue for some billionaire.

I learned to paint because I felt like I would enjoy it, and I do immensely. I can easily take a photo, but painting gives me a unique experience that I can't get from a taking a photo. I learned to code because I really enjoyed making a computer do things, and programming gives me an experience I can't get if I let a Large Language Model do it for me. I code because I feel like I have to, not just because it pays the bills.

What does this mean for you as a new programmer? The story of photography and painting continues in the 1900s when painters realized they didn't have to do realistic paintings anymore. They could paint whatever they wanted, so they made paintings that reflected who they were and what they saw. Painting changed from a thing you did to pay the bills into a vehicle of human expression, which is what we consider art today.

I believe this will happen to programming soon as well. You'll see programmers being liberated from having to do mundane boring tasks like "make this button 20% larger." They'll instead be able to use computation to express their thoughts and feelings. Sure, people will still obviously do the boring work when they need money (and there's no shame in that). Many artists have painted a few cat portraits to pay the rent. But the vast majority of programming will change into a new art form for expressing yourself rather than just a boring job.

There's also the potential for AI to only wipe out jobs for junior developers, since every company will want to use AI to speed up development, but no company will trust the code AI writes. You can think of this possibility as companies keeping professional developers, and simply firing all the juniors to replace them with AI. Learning how to code well without assistance will be how you become good enough to ironically get a job using AI to help you code. You will also have this same problem in your own projects. If you are not very good at programming on your own then how would you know the code that an AI tool generates is good?

These events are not something that will happen soon, but I hope this book prepares you for the change. Learning about Data Science is the first step to understanding how Generative AI Models work. Understanding how this technology works will give you some control over the future of programming. Learning to code now is also the first step to creating the software you want to create. Maybe the future is everyone becomes some kind of indie game developer? Who knows, but you now have an amazing future ahead of you if you're willing to be flexible and embrace the new things that come along.

Until then, I'll be happy if you take what I taught you and get a job or create a small business. I don't want you to think I'm against programming as a job. After all, the artistic future of programming can't happen if you can't eat.


Page 11

At this point in the course you know Python.
The two most popular uses of Python are _Data Science_ and _Web Scraping_ because web scraping typically feeds data to your data science pipeline.
In this exercise you'll access the _Application Programmer Interface (API)_ I use for my learncodethehardway.
We'll now explore [Pandas](undefined)(https://pandas.
This exercise is going to teach two very important skills.
In this exercise you'll take the mess of random scripts you've made and create one clean tool that uses only Pandas for the entire process.
You can't do science without data, and the most widely used language for storing and managing data is SQL.
Our final exercise is going to cover the concept of relations in SQL.
Imagine it's 1820 and you want a nice portrait of your mother.
總結
The article emphasizes that knowing Python is not enough to be considered proficient; true capability comes from applying knowledge to create software. It compares programming to creative practices like music and writing, where mere memorization does not equate to skill. The final module aims to transition learners from understanding Python to effectively using it to build software, despite the frustrations that come with expressing ideas in code. The author outlines three key components for upcoming exercises: tackling abstract challenges, incorporating advanced Python concepts, and exploring related technologies. Web scraping and data science are highlighted as popular applications of Python, with web scraping being accessible for beginners due to its straightforward nature. The article also introduces the concept of APIs, explaining their role in web development and data retrieval. Additionally, it discusses the importance of the Pandas library for data manipulation and the utility of Pandoc for converting text formats. Finally, it stresses the significance of understanding programmer documentation, which often lacks cohesion, making it essential for learners to grasp how different components fit together in programming.