Image by author.

Graphs, Gradients, and Groupby

What my first major Data Science project has taught me.

Diane Tunnicliffe
12 min readOct 15, 2020

--

As I wrap up my first project for Phase 1 in Flatiron School’s Data Science Program, I am overcome with the number of things I now know how to do, that I did not even know existed just two short months ago. The pace of this program is intense, to say the least. I find myself constantly juggling the many responsibilities of mom life while trying to fit in as much time as I can to devote to my learning. Essentially, what this means is that my two-year-old is frequently mad at me… and I haven’t been sleeping. (So if anyone wants to buy me an October gift, I’m a size XXXL coffee, thank you.)

However, this project has taught me many things. Before I get into the nitty-gritty codestuff, I need to touch on the fact that working on this project has taught me how to do better in the future. The issue I struggled with, and continue to struggle with, is that I’m a perfectionist by nature when it comes to academics. People who know me are sometimes surprised to hear this, because the standard mess-level in my house is not one that you’d associate with perfection. I write this staring at a room littered with toys and a table decorated with yesterday’s craft supplies. But when it comes to anything I’m learning or being evaluated on, I hold myself to a very high standard. And that has gotten me into trouble here.

The problem? We had two weeks to complete a major project as a group. I did it on my own, started it early, and gave myself about three weeks for completion. But I did three different versions of the project because I kept trying to switch directions and make it better. I kept telling myself I still had time, so I’d use that time to explore other paths. Instead of staying on track and focusing on getting through the necessary tasks, I kept abandoning my data in favor of something new or something that seemed like it would be better than what I had. I imported, I web-scraped, I made API calls, I cleaned and prepared the data — and then I started all over again with something different. In the end, despite the fact that I started early, I ended up rushing to complete the project because I switched directions so many times. This has taught me, or reminded me, a very important lesson, which is to stay on track and not be so hard on myself to make things perfect in the beginning stages. I’ve learned that I am far too old to be pulling all-nighters and eating cereal at midnight while I count how many hours I have until the kids wake up!

The other problem is that, again with this perfectionist thing, I kept creating more and more visualizations for my analysis. We were required to have 4 meaningful visualizations, and at the time of writing this, I had 27. The problem? I could have focused on making less visualizations that were more amazing, instead of an abundance of visualizations of varying greatness. My only explanation for this graphing and plotting overkill is that I was in the mindset of completing an extremely thorough analysis. I kept thinking, “now if I plot it like this, then my graph will show the data the way that I want it to be shown.” To be honest, I enjoyed the plotting and was excited every time I found a new way to display the data. (Brief digression: I was really amped about a particular seaborn violin plot I had made, until an anonymous source informed me that this particular plot looked like a rainbow of vaginas. So then I had to switch directions once again, only because I couldn’t unsee it, and sadly vaginas were irrelevant to my project.) Again, not a problem if you aren’t on a very strict time-limit, but unfortunately, that’s exactly what I was on.

Now that I’ve had some time to reflect on my process, and have gained some very important insights for myself going forward, I can allow myself a moment to be proud of what I have learned and accomplished so far:

  • Web-scraping. Web-scraping was a topic I was highly uninterested in when we first learned about it in the course material. I disliked the ethical gray-area-ness of it; it felt inelegant compared to other data acquisition methods. But I think my lack of initial interest was only due to lack of necessity. Once I found myself working on a project that required specific data, it became clear pretty quickly that web-scraping can be a fast and helpful way to import information. To be able to generate a financial report on a website and then write a quick python function to grab that report and import it as a Pandas DataFrame was something I certainly didn’t know how to do until recently. So that felt pretty awesome once I realized the speed and simplicity of it. I may or may not have jumped up from my desk and danced around every time I was able to execute successful functions like this.
def number_scraper(year):
"""
Scrapes 100 top-grossing movies from The-Numbers.com.

Adds year input to url and scrapes resulting table.

Parameters:
year (int): user input 4-digit year for movie gross to be scraped.

Returns:
numbers_df (Pandas DataFrame): A dataframe populated with movie gross table values.

"""
# url for the full customized report of top 100 movies for release years in range listed
url = f"https://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/{year}/{year}/None/None/None/None/None/None?show-release-date=On&view-order-by=domestic-box-office&show-release-year=On&view-order-direction=desc&show-production-budget=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&show-genre=On&show-production-method=On&show-creative-type=On"
response = requests.get(url)
# creating soup
soup = BeautifulSoup(response.text, 'lxml')
# finding table
table = soup.find('table')
# converting html of table into string
table_string = f"""{table}"""
# reading html string into pandas
table_read = pd.read_html(table_string)
# converting into DataFrame
numbers_df = table_read[0]
return numbers_df
  • Making API calls. I very much enjoyed the course material on JSON and API calls. So it was no surprise that I had to incorporate this into my project. How amazing to be able to write a function that takes a movie title and returns literally anything you could possibly want to know about said movie! When I wrote a function that takes a movie’s title as an argument and retrieves its release date, director, and composer, then adds it to my DataFrame, I was incredibly proud. I also liked being able to keep my API key in a secret folder and just set it as a variable to be used in the function.
def get_keys(path):
"""Retrieves API key from files as api_key."""
with open(path) as f:
return json.load(f)
keys = get_keys("/Users/dtunnicliffe/.secret/TMDb_api.json")
api_key = keys['api_key']
def get_director(title):
"""
Updates director information for movie in dataframe.

Queries TMDB for a given movie title.
Retrieves TMDB movie_id for title.
Retrieves director information based on movie_id.
Adds director information to a list.
Converts director information from list to string.
Adds new director value as string to movie's row in dataframe.

Parameters:
title (str): user input movie title.

Returns:
Updated cells in Pandas DataFrame.

"""
title_r = title.replace(' ', '+')
url = f"https://api.themoviedb.org/3/search/movie?api_key={api_key}&query={title_r}"
response = requests.get(url)
if len(response.json()['results']) > 0:
movie_id = response.json()['results'][0]['id']
url2 = f"https://api.themoviedb.org/3/movie/{movie_id}/credits?api_key={api_key}"
response2 = requests.get(url2)
crew = response2.json()['crew']
directors = []
for member in crew:
if member['job'] == 'Director':
directors.append(member['name'])
d = str(directors)
d = d.replace('[', '').replace(']', '').replace("'","")
merged_df.loc[merged_df['title']==title, 'director'] = d
else:
pass
# creating a list of all the super hero movie titles
superhero_titles = [title for title in superhero['title']]
# getting director info for movies in list and updating data accordingly
for title in superhero_titles:
get_director(title)
  • Plotting! Yes, we’ve already touched on my obsession with plots and graphs. But this really was a fantastic opportunity to expand my plotting knowledge. For this project, I utilized matplotlib and seaborn for my plots, and I genuinely enjoyed doing so. As a visual learner, I can see firsthand how the plots tell a story and how important they are for a data scientist to convey their findings to another person or organization. One thing that was both incredibly tedious but also incredibly rewarding was annotating or labeling the points on some of my plots. For instance, I chose to annotate the points that represented the top-grossing movies for the subcategories my project focused on, super hero movies and animated children’s musical movies. To get the movie titles in the right positions was time-consuming and sometimes frustrating. But to see the end result made it all worth it for me.
# plotting world gross by budget for animated movies
sns.set_style('darkgrid')
sns.lmplot(x='budget_in_mil', y='world_gross_mil', data=merged_df.loc[merged_df['prod_method']=='Digital Animation'], aspect=3)
plt.title('World Gross by Budget for Animated Movies', fontsize=20)
plt.xlabel('Budget in Millions', fontsize=15)
plt.ylabel('World Gross in Millions', fontsize=15)
plt.xlim(0, None)
plt.axhline(y=1000, ls='--', c='green')
for v, w, x, y, z in zip(merged_df['genre'], merged_df['prod_method'], merged_df['budget_in_mil'], merged_df['world_gross_mil'], merged_df['title']):
if (z=='Frozen'):
plt.text(x=x-15, y=y-15, s=z.upper(), fontsize=12, color='black')
elif z=='Toy Story 3':
plt.text(x=x-23, y=y-10, s=z.upper(), fontsize=12, color='black')
elif z=='Zootopia':
plt.text(x=x+2, y=y, s=z.upper(), fontsize=12, color='black')
elif (z=='Toy Story 4'):
plt.text(x = x+2, y = y, s = z.upper(), fontsize=12, color='black')
elif (w=='Digital Animation') and y>1000:
plt.text(x = x+2, y = y-15, s = z.upper(), fontsize=12, color='black');
#saved in images as fig22
#plt.tight_layout()
#plt.savefig('./images/fig22.png')
# plotting world gross by budget and release month for super hero movies
sns.set_style('darkgrid')
g = sns.relplot(x='budget_in_mil', y='world_gross_mil', data=merged_df.loc[merged_df['creative_type']=='Super Hero'], hue='release_month', hue_order=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], s=130, aspect=3, palette='Paired')
plt.title('World Gross by Budget and Release Month for Super Hero Movies', fontsize=20)
plt.xlabel('Budget in Millions', fontsize=18)
plt.ylabel('World Gross in Millions', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlim(0, None)
g.fig.set_size_inches(16,10)
for w, x, y, z in zip(merged_df['animated_or_super'], merged_df['budget_in_mil'], merged_df['world_gross_mil'], merged_df['title']):
if (z=='Avengers: Endgame'):
plt.text(x=x-60, y=y-15, s=z.upper(), fontsize=12, color='black')
elif (z=='Avengers: Infinity War'):
plt.text(x=x+2, y=y-15, s=z.upper(), fontsize=12, color='black')
elif (z=='Avengers: Age of Ultron'):
plt.text(x=x-76, y=y-15, s=z.upper(), fontsize=12, color='black')
elif (z=='The Avengers'):
plt.text(x=x+2, y=y-15, s=z.upper(), fontsize=12, color='black')
elif (z=='Spider-Man: Far From Home'):
plt.text(x=x-80, y=y-15, s=z.upper(), fontsize=12, color='black')
elif (z=='Aquaman'):
plt.text(x=x, y=y+10, s=z.upper(), fontsize=12, color='black')
elif (z=='Captain Marvel'):
plt.text(x=x+2, y=y-35, s=z.upper(), fontsize=12, color='black')
elif (w == 'super') & (y>1000):
plt.text(x = x+2, y = y-15, s = z.upper(), fontsize=12, color='black')
plt.axhline(y=1000, ls='--', c='green');
#saved in images as fig25
#plt.tight_layout()
#plt.savefig('./images/fig25.png')
Image by author.
  • Hues and color palettes. This piggybacks on the previous point, but using hues and color palettes with seaborn was both useful and fun. I opted to set most of my plots to the same color scheme so that they would match my presentation slides. But I loved the ability to set different gradients and color palettes in so many aesthetically pleasing combinations. I also consistently utilized the hue parameter in my seaborn plots to differentiate between different values, like movie release month or movie type.
Image by author.
Image by author.
  • Mathematical/Statistical techniques. Of course, a major component of any data science project is math. I was glad to be approaching this project already having knowledge of most of the statistical tools I would need. The simple .describe method went a long way in grabbing important info like mean, median, and percentiles for my numerical value columns, such as domestic, international, and worldwide box office gross or production budget. It was also easy enough to calculate interquartile range (IQR) and any other descriptive statistics I happened to require.
  • Groupby. Using pandas.groupby was something I was a bit shaky with before this project. I understood it, but only in an abstract way. Having used it with the real movie data I acquired, I now have a much firmer grasp on how to use this in the future. In this project, I used it frequently to find mean and median world gross for categories like release month or movie type.
# getting mean and median world gross amounts by genre
genre_stats = merged_df.groupby('genre')['world_gross_mil'].agg(['median', 'mean'])
genre_stats.sort_values(by='mean', ascending=False)
Image by author.
  • Data cleaning. I’m putting this towards the bottom of this list (even though it arguably belongs at the top of the list in terms of importance) because this part of the project, for all three versions of the project that I began, was the most daunting and the least enjoyable. By my third and final project version, I had added all my cleaning actions to one singular function to help speed up the cleaning and make it easier. But removing duplicates, figuring out what to replace each null value with (and then the actual replacing of said null value), modifying the column names and the actual columns, and figuring out when dropping of data was truly necessary… not so fun. Not to mention dealing with data that was in a string format instead of an integer format, or approaching years that were floats intstead of datetime…
def clean(df):
"""
Cleans and modifies a given dataframe according to criteria set for this particular project.

Drops column called 'Unnamed:0'.
Converts column names to lowercase.
Renames certain columns to make them shorter/more comprehensive.
Removes dollar signs and commas from dollar amounts.
Converts dollar amounts from strings into integers.

Parameters:
df (Pandas DataFrame): user input dataframe based on previously scraped table values from The-Numbers.com.

Returns:
df (Pandas DataFrame): A dataframe cleaned and adjusted as per criteria listed above.

"""
# drop 'Unnamed' column
df = df.drop(columns='Unnamed: 0')
# make column names lowercase
df.columns = [x.lower() for x in df.columns]
# rename certain columns
df = df.rename(columns = {'released':'release_date', 'released.1':'release_year', 'productionmethod':'prod_method',
'domesticbox office':'dom_gross', 'internationalbox office':'int_gross', 'worldwidebox office':'world_gross',
'creativetype':'creative_type', 'productionbudget': 'budget'})
# removing dollar signs and commas from dollar amounts
# converting dollar amounts from strings into integers
df['dom_gross'] = df['dom_gross'].str.replace(',', '').str.replace('$', '').astype(int)
df['int_gross'] = df['int_gross'].str.replace(',', '').str.replace('$', '').astype(int)
df['world_gross'] = df['world_gross'].str.replace(',', '').str.replace('$', '').astype(int)
df['budget'] = df['budget'].str.replace(',', '').str.replace('$', '').astype(int)
return df
  • Datetime. I went from knowing absolutely nothing about datetime to knowing at least enough to make some of my columns into datetime values and use feature engineering functions (i.e. column creation) based off of said values. So that was cool. I still have a lot to learn on this topic, and as a result, had to do certain things manually like set my release month order by name after many attempts to get my graphs to display in month order using datetime. But I am happy about the fact that I learned enough to convert some of my data to datetime values and filter my DataFrame on the basis of datetime values as well.
# converting release_date column to datetime object
merged_df['release_date'] = pd.to_datetime(merged_df['release_date'])
# generating release_month value based on datetime release_date
merged_df['release_month'] = merged_df['release_date'].dt.strftime('%B')
  • Presenting. This is something I have struggled with since long before my data science path began. When I was in high school, “sophomore speeches” were a hallmark of our tenth-grade year, and I still remember the anxiety of standing in front of my entire grade of almost 500 students to deliver my speech on vegetarianism because I (unfortunately) made it to the final round. This is slightly easier because I’m presenting my nontechnical findings to one instructor over Zoom, but the anxiety is still very real. This is a skill I know I’ll need to improve so that I can be comfortable communicating my findings in my career as a data scientist. (Shoutout to my cohort-mate, Fen, for volunteering their time to help me practice this and get more comfortable with the idea of presenting! Connections like this make the journey so much better.)

Now that I can somewhat breathe again, we are entering Phase 2 on probability, statistical distributions, central limit theorem, hypothesis testing, and ANOVA. So at this point I’m just hoping that statistical knowledge is like riding a bike, and that everything I learned from my undergraduate stats days starts re-revealing itself in my brain. If you’ve made it this far, please feel free to reach out with any feedback! Thank you for reading.

For my entire Phase 1 project, including full Jupyter notebook and pdfs of my slideshow presentation, please visit my Project 1 GitHub Repo here.

--

--