Blog post banner

Web scraping economic indicators into Excel using Pandas

August 10, 2018

Previous entries on our blog have shown how easy it is to get data from APIs into Excel using Gridarrow. But what happens if the data you need isn’t easily accessible via an API, but is available as html tables. That’s where web scraping comes in!

Excel does have some useful features for reading html tables from the internet using PowerQuery, however particularly in cases where the data you need is spread across multiple pages, or needs tidying up in other ways, using Python is far more powerful.

What is Gridarrow?

Gridarrow is the missing server for Excel. It connects just about any data source to Excel with simple Python scripts.
Why is it awesome?
  • It's real-time, so as soon as the data is updated in the Gridarrow App it's updated in Excel, no need to poll or refresh.
  • It's multi-user so your colleagues can connect to the same data at the same time.
  • It simplifies your spreadsheets by moving data retrieval and pre-processing into Python.
Request Demo

Gridarrow ♥ Pandas

There are lots of tutorials on the internet for learning web scraping with Python, but in this blog post we’re going to use the Pandas library, simply because it’s the easiest way to go from web data to Excel with the minimum of fuss, and the maximum flexibility.

A simple example

The code below will parse the big table of economic indicators from the Trading Economics website and make it available for all the users in your Gridarrow workspace to connect to.

The magic happens in the pd.read_html line, where Pandas loads the web page, parses the html table containing the text COUNTRIES, sets the first row as the header, and returns a DataFrame.

Simply assigning the Dataframe .values to our summary grid is enough to make the data flow to Excel.

1234567
import pandas as pd

@arrow.on_start
def on_start(_):
    df = pd.read_html("https://tradingeconomics.com/", header=0, match="COUNTRIES")[0]
    grid = arrow.grid('summary', rows=len(df.index), columns=list(df.columns))
    grid[:, grid.columns] = df.values          

Building a dashboard of indicators

The previous example already allowed us to get some useful data into Excel, with the latest value for 11 economic indicators across the 50 biggest economies in the world. However, the trading economics website has more than 100 indicators that we can use to get deeper with our analysis.

Here’s an example dashboard where we’ve used Gridarrow and Pandas to get these detailed indicators into Excel, showing the recent change in the indicator, and where the indicator is in its historic range.

Economics dashboard in Excel using web scraping

You can download this spreadsheet, and add the Trading Economics app to your workspace from the Gridarrow catalog, so you can check it out right away.

Don’t forget to update the name of the workspace in cell D2 to reflect your own workspace.

But before you jump into it, remember to:

 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041
import pandas as pd

from gridarrow import Arrow, DefaultValue

arrow = Arrow()

G20 = ["united-states", "euro-area", "china", "japan", "germany",
       "united-kingdom", "france", "india", "italy", "brazil",
       "canada", "south-korea", "russia", "spain", "australia",
       "mexico", "indonesia", "turkey", "netherlands", "switzerland",
       "saudi-arabia"]

@arrow.on_schedule("0 8 * * *", timezone='US/Eastern', initialize=True)
def on_schedule(_):
    for country in G20:
        df = pd.read_html(f"https://tradingeconomics.com/{country}/indicators")[1]

        # drop rows where the value of the "Last" column is Last
        df = df[df['Last'] != 'Last']
        # rename column
        df = df.rename(index=str, columns={"Markets": "Indicator"})
        # expand the Range column into min and max columns
        df[['Min', 'Max']] = df['Range'].str.split(' : ', expand=True)
        # expand the Last column to separate the units
        df[['Last', 'Unit']] = df['Last'].str.split('  ', expand=True)
        # drop unnecessary columns
        df = df.drop(['Range', 'Unnamed: 6'], axis=1)
        # convert numeric columns to numeric type
        numeric_cols = ['Min', 'Max', 'Previous', 'Last']
        df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')
        # reorder columns
        df = df[['Indicator', 'Reference', 'Frequency', 'Unit'] + numeric_cols]

        if country in arrow.grids:
            arrow.grids[country].close()

        grid = arrow.grid(country, rows=len(df.index),
                          columns=list(df.columns),
                          default_value=DefaultValue.EMPTY_STRING)

        grid[:, grid.columns] = df.values