Web Scraping for Data Pipelines in Python

Welcome to the Web Scraping for Data Pipelines in Python tutorial. In this tutorial we will be scraping the web for data and building a table from the data. This technique is useful for gathering data from unstructured data sources. The type of data we will be working with is price stocks, which falls under the finance industry.

Lets Begin!

We need to import a couple of python libraries that we will need for data exploration, processing, visualization and web scraping.

Pandas - We will be using pandas dataframes for it's functions and for a tabular representation of our dataset.

BeautifulSoup - For parsing HTML and XML documents. We will be using this library for web scraping.

requests- To download the web page in HMTL format.

We will be working in Jupyter notebook for this tutorial, but you are welcome to use a Python IDE of your choice.

Let's import our libraries:

import pandas as pd

import requests

from bs4 import BeautifulSoup

Next, we make a GET request to the website that we will be scraping. We need to store the page object in a variable. We will then use this object to access the html code of the website.

page = requests.get("https://www.fool.com/investing/top-stocks-to-buy.aspx")

page

Output:

<Response [200]>

Fig 1: Website to scrape

We will be working with the html of the page so let's view it:

page.content

Output:

b'\n\n<!DOCTYPE html>\n\n<html lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# article: http://ogp.me/ns/article#">\n\n <head>\n <script>\n // usmf-django\nwindow.top!=window&&(window.analytics=window.top.analytics,window.inIframe=!0);var segmentKey="16mdwrvy5p",segmentSnippetVersion="4.1.0",getSegmentUrl=function(t){var t=t||window.segmentKey;return("https:"===document.location.protocol?"https://":"http://")+"cdn.segment.com/analytics.js/v1/"+t+"/analytics.min.js"},trackerMaker=function(t){var n=[];n.invoked=!1,n.methods=["trackSubmit","trackClick","trackLi

Next, we are going to create a beautiful soup object and parse the page content to it:

soup = BeautifulSoup(page.content, 'html.parser')

Now let's view the page content, after we have parsed it to our BeautifulSoup object:

print(soup.prettify())

Output:

<!DOCTYPE html>

<html lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# article: http://ogp.me/ns/article#">

<head>

<script>

// usmf-django

window.top!=window&&(window.analytics=window.top.analytics,window.inIframe=!0);var segmentKey="16mdwrvy5p",segmentSnippetVersion="4.1.0",getSegmentUrl=function(t){var t=t||window.segmentKey;return("https:"===document.location.protocol?"https://":"http://")+"cdn.segment.com/analytics.js/v1/"+t+"/analytics.min.js"},trackerMaker=function(t){var n=[];n.invoked=!1,n.methods=["trackSubmit","trackClick","trackLink","trackForm","pageview","identify","reset","group","track","

As we can see the html text in now readable, which will help us identify the html tags that contain the data we want.

The output is readable but, it is not easy to interact with, so we will use google chrome's developer tools to identify the tags that contain the data we want to scrape. So let's go to the website we are going to scrape:

https://www.fool.com/investing/top-stocks-to-buy.aspx

Next, locate the Socks section of the website:

Our main goal is to scrape the list of stocks and build a dataset from it. So now let's right click the Stocks section and select 'inspect'.

In the above image, you will see that the 'Stocks' heading has been highlighted. You can now interact with the tags and you will see that on the left hand side, the equivalent section will be highlighted as well, this let's us know exactly which tag belongs to which section on the web page.

Each stock has it's own div element, as can be seen below. We will loop through each div element to extract the data on each stock.

Next, we need to isolate the Stocks section of the web page, and create a subset of our page content, and then store that in a variable called stocks. In order to achieve this, we are going to use the find method of the BeautifulSoup object. We will need to parse the class name ('related-tickers') of the tag that contains our Stocks section.

stocks = soup.find(class_='related-tickers')
stocks

Output:
<section class="related-tickers">
<div class="wayfinder with-rule">
<hr class="wayfinder-rule"/>
<h2>Stocks</h2>
</div>
<div class="ticker-row" data-instrument-id="202816">
<span class="image-wrap">
<a class="quote-image" href="https://www.fool.com/quote/nasdaq/amazon/amzn/">
<h5>AMZN</h5>
<img alt="Amazon Stock Quote" data-img-src="https://g.foolcdn.com/image/?url=https%3A%2F%2Fg.foolcdn.com%2Fart%2Fcompanylogos%2Fmark%2FAMZN.png&w=64&h=64&op=resize" src=""/>
</a>
</span>
<div class="ticker-text-wrap">
<h3>Amazon</h3>
<h4 class="h-margin-b">
<span class="ticker">
<a href="https://www.fool.com/quote/nasdaq/amazon/amzn/" title="Amazon Stock Quote">
NASDAQ:<span class="symbol">AMZN</span>
</a>
</span>
</h4>

Now the stocks object contains all of our data and we only need to focus on it, and we can ignore the rest of the page content.

In our stocks object, we will search for all the tags with the class name 'ticker-text-wrap'. This will create a list of each stock including the information on each stock like the stock price etc. We will store the list in a variable called stock_picks.

Below is a preview of the ticker-text-wrap div section:

stock_picks = stocks.find_all(class_='ticker-text-wrap')
stock_picks

Output:
[<div class="ticker-text-wrap">
<h3>Amazon</h3>
<h4 class="h-margin-b">
<span class="ticker">
<a href="https://www.fool.com/quote/nasdaq/amazon/amzn/" title="Amazon Stock Quote">
NASDAQ:<span class="symbol">AMZN</span>
</a>
</span>
</h4>
<aside class="price-quote-container smaller">
<h4 class="current-price">
$1,900.10
</h4>
<h4 class="price-change-arrow price-neg">
<span style="position:absolute;left:-999em;">down</span>
<i class="fool-icon-arrow-down"></i>
</h4>
<h4 class="price-change-amount price-neg">
$-55.39
</h4>
<h4 class="price-change-percent price-neg">
(-2.83%)
</h4>
</aside>
</div>, <div class="ticker-text-wrap">
<h3>Constellation Brands</h3>
<h4 class="h-margin-b">
<span class="ticker">
<a href="https://www.fool.com/quote/nyse/constellation-brands/stz/" title="Constellation Brands Stock Quote">
NYSE:<span class="symbol">STZ</span>
</a>
</span>
</h4>
<aside class="price-quote-container smaller">
<h4 class="current-price">
$144.88
</h4>
<h4 class="price-change-arrow price-pos">
<span style="position:absolute;left:-999em;">up</span>
<i class="fool-icon-arrow-up"></i>
</h4>
<h4 class="price-change-amount price-pos">
$4.19
</h4>
<h4 class="price-change-percent price-pos">
(2.98%)
</h4>

</aside>

Now we have all the stocks and the information on each stock stored in a list, this will allow us to loop through the list and extract information on each stock.

We will now start to build our dataset, so the first thing that we need to do is get all the company names of each stock and store it in a list:

# Get stock names
stock_names = []
for stock in stock_picks:
stock_names.append(stock.h3.get_text())

stock_names

Output:

['Amazon',
 'Constellation Brands',
 'Ford',
 'AT&T',
 'Vanguard Total Stock Market ETF',
 'Axon Enterprise',
 'Alphabet (C shares)',
 'Vanguard Total International Stock ETF',
 'Netflix',

The next thing that we will do is extract all the stock symbols:

# Get stock symbol
stock_symbol = []
for stock in stock_picks:
stock_symbol.append(stock.a.span.get_text())

stock_symbol

Output:
['AMZN',
'STZ',
'F',
'T',
'VTI',
'AAXN',
'GOOG',
'VXUS',
'NFLX',

The next thing that we will do is extract the current price for each stock:

# Get Current Price
current_price = []
for stock in stock_picks:
price = stock.aside.h4.get_text()
current_price.append(price.strip())

current_price

Output:
['$1,900.10',
'$144.88',
'$5.19',
'$29.84',
'$127.00',
'$74.41',
'$1,110.71',
'$41.41',

The next thing that we will do is extract the price change amount for each stock. We will be using the stock object instead of the stock_picks object. We will search for the 'price-change-amount' class, to get the price change values.

price_change = stocks.find_all(class_='price-change-amount')
# Get Change Price
change_price = []
for change in price_change:
price = change.get_text()
change_price.append(price.strip())

change_price

Output:
['$-55.39',
'$4.19',
'$0.06',
'$0.76',
'$-4.12',
'$-2.11',
'$-51.04',
'$-1.48',

The next thing that we will do is extract the price change percentage of each stock, using the same method mentioned above. We will use the stock price object instead of stock_picks.

percent_change = stocks.find_all(class_='price-change-percent')
# Get Change Percent
change_pct = []
for pct in percent_change:
price = pct.get_text()
change_pct.append(price.strip())

change_pct

Output:
['(-2.83%)',
'(2.98%)',
'(-1.14%)',
'(-2.48%)',
'(-3.14%)',
'(-2.76%)',
'(-4.39%)',
'(-3.45%)',
'(-1.62%)',

Now we have extracted all the data that we need from each stock, and stored the values in lists. We will then use these lists to create a pandas dataframe.

Each list represents a field in our dataframe, these are the field names we will create:

Symbol - Stock symbol
Company - The company name of the stock
Price - The current stock price
PriceChange - The current stock price-change amount
PercentChange - The current stock price-change percentage value

The first thing we need to do to build our dataframe, is to create a dictionary from all of our lists:

data = {'Symbol':stock_symbol, 'Company':stock_names, 'Price':current_price,
'PriceChange':change_price, 'PercentChange':change_pct}

Now we can create our dataframe using the dictionary we have created:

df = pd.DataFrame(data)

Now let's preview our dataframe:

df.head()

Output:

As you can see we have successfully created a table using the data that we have derived from scraping the website. Right now the data is semi-structured. More transformations can be applied to clean the data and remove all special characters from the numerical values.

I will leave that as an exercise.

Want to build complete web scraping applications? Try out our course on Web Scraping and Mapping Dam Levels in Python and Leaflet.

https://ebisysedulytica.teachable.com/p/web-scraping-and-mapping-dam-levels-in-python-and-leaflet

Web Scraping for Data Pipelines in Python

Post a Comment