Web Scraping and PostgreSQL

PostgreSQL and Python scraping #

Been meaning to learn Python programming? Want to see what the hype around
PostgreSQL is all about? Well the best way to do both is to dive right into it
with a scraping project. Following along, you'll learn the basics of running a
web scraper on a website to gather information and then put that information
into a database where you will be able to make useful queries from.

[PostgreSQL][2] has been touted as the "world's most advance open source
database", and offers various features that make it stand out from other
SQL databases such as MySQL and MariaDB. Although users might not immediately
make use of those features, it is good to start with a database that has those
features available if needed in the future. With a database chosen, the next
step is to fill the database with data. It doesn't really matter if you choose
a database first or the data first, what is important is that you choose the
tool that gets the job done.

The data #

Scraping data from a website is a way to take information available online,
and then either manipulate it immediately, or store it in a database for
future functionality. In this example, the data will come from an audition and
jobs site, StarNow which allows models and photographers to advertise
their talent. Exploring the website, model profiles include a plethora of
data, ranging from acting and model credentials, to physical details of the
model. Depending on how much data you want, and the goals of your project, you
can choose to scrape all the data available, or just a small subset to make
the job quicker. Here the goal will be to scrape all the physical attributes
of models, and then put those attributes into a PostgreSQL database, using
Python.

Creating the database #

Following the installation instructions from the PostgreSQL website for
your OS, the first step is to create the database, either locally or in the
cloud with Compose.io. In order to create a new database, you must first
connect to the default postgres database and then run the command CREATE DATABASE starnow. Now there will be a starnow database available locally
(or on the cloud), and with that you will be ready to start putting data into
it. Alternatively, this can also all be completed within a Python script with
the psycopg2 library. This library is an adapter layer that allows Python
to communicate with the Postgres database.

    from psycopg2 import connect
    from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
    
    con = None
    # connect to the default admin database
    con = connect(database='postgres')
    
    dbname = "starnow"
    
    # autocommit ends transaction after every query
    con.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
    cur = con.cursor()
    cur.execute('CREATE DATABASE ' + dbname)
    cur.close()
    con.close()

Scraping the data #

Beautiful Soup is a Python library for scraping and parsing, and although
other Python scraper libraries exist, this one is very well documented and
easy to use. Along with the requests library for HTTP communication,
these two tools are the bread and butter of Python scraping for projects that
aren't too complex. Other options will be discussed later. The general
scraping process is then: 1. request search page 2. scrape search page for
models 3. request model page 4. scrape model page for physical attributes
5. repeat for all models on current search page 6. scrape next search page
till no more search pages

The data fields that are available on the profile pages on StarNow are limited
to: variables = ['name', 'gender', 'location', 'username', 'talenturl', 'roles', 'height', 'hips', 'weight', 'shoe', 'ethnicity', 'hair', 'skin', 'hairlength', 'eye', 'hairtype', 'chest', 'dress', 'waist']

These variables will then become the columns for the physical table within
the starnow database.

    CREATE TABLE physical (id serial PRIMARY KEY, name varchar, gender varchar,
     location varchar, username varchar, talenturl varchar, roles varchar, height varchar, hips varchar,
      weight varchar, shoe varchar, ethnicity varchar, hair varchar, skin varchar, hairlength varchar, eye varchar, hairtype varchar, chest varchar,
       dress varchar, waist varchar)

This will create a table called physical with all the variables in lower
case as type varchar. If you want to use case sensitive column names, enclose
the variable in ", eg. 'Name' varchar. For the ease of putting data into the
database quickly, the types were all chosen to be varchar, meaning Python
strings.

The lazy method of scraping the results page is to first look at how many
results pages there are, in this instance the last page is 3528. Knowing this,
iterating requests through all the results pages is then: import requests from bs4 import BeautifulSoup for x in range(1, 3528): searchPage = "http://www.starnow.com.au/talent/?p=%d" % x # grab html and then use bs4 to parse try: r = requests.get(searchPage) soup = BeautifulSoup(r.content) except: print "Error getting page " + searchPage pass

Parsing with Beautiful Soup #

Each search result page returns 20 different profiles; the next step is then
to identify how each profile in the HTML source. This is often the part where
scraping data gets messy and dirty, sometimes website don't have logical
coding styles or naming patterns and it will be left to the scraper to dig
through it all. Luckily with StarNow the code is nicely tagged, so each of the
20 ads on the result page can be obtained by

    soup.find_all("div", {'itemtype': ['http://schema.org/Person']})

which gives a Python list of 20 results. The next step is to request the model
page, which is easily done once the username is obtained for talent in directory: # go through the 20 ads and parse within them # username in <a href=""> data['name'] = talent('a')[0]['href'][1:] data['url'] = "http://www.starnow.com.au/" + data['name'] try: personalPage = requests.get(data['talenturl']) personalSoup = BeautifulSoup(personalPage.content) except: print "error getting model page" + data['talenturl'] pass

Once the model's profile page is obtained, you can begin looking for the data
that fills the table. A very handy feature of BeautifulSoup is the find
function, which allows you to search through the document for certain html
tags. The ['content'] gives the contents within the tag that BeautifulSoup
finds

    data['gender'] = soup.find('meta', {'property':'gender'})['content']
    data['name'] = soup.find('meta', {'property':'name'})['content']
    data['location'] = soup('span')[2].string.strip()
    data['username'] = soup.find('meta',{'property':'profile:username'})['content']
    data['roles'] = soup.find_all('div', {'class': 'profile__roles'})[0].string.strip()

The physical details are located within a table, but BeautifulSoup makes it
easy to find. physical = soup.find('table', id='ctl00_ cphMain_physicalDetails_memberAttributes') With the physical details table
HTML, the rest of the data for the database table can then be filled in using
a simple iteration.

    # create a dict mapping the db column names to search terms
    searchword = {'height':'Height:', 'hips':'Hips:', 'weight':'Weight:', 'shoe':'Shoe size:', 'ethnicity':'Ethnicity:',
                  'hair':'Hair color:', 'skin':'Skin color:', 'hairlength':'Hair length:', 'eye':'Eye color', 'hairtype':'Hair type:',
                  'chest':'Chest:', 'dress':'Dress Size:', 'waist':'Waist:'}
    
    if 'Model' in data['roles'] or 'Actor' in data['roles']:
        for x in searchword.keys():
            try:
                data[x] = physical('div', text=re.compile(searchword[x]))[0].next_sibling.strip()
            except:
                pass

Putting the data into the database #

Once all the data is collected, you are ready to put it into your PostgreSQL
database. The psycopg2 library makes this easy once again

    try:
        conn = psycopg2.connect(database="starnow")
        cur = conn.cursor()
    except:
        print "Unable to connect"
    
    cur.execute("INSERT INTO starnow (%s) VALUES (%s)" %(','.join(data), ','.join('%%(%s)s' % k for k in data)), data)
    
    # don't forget to commit to make changes persistent 
    conn.commit()

That last step had a bit of weirdness going on with the INSERT statement,
but all it does is go through the dict data, and then format it into such a
way so that the SQL command will be executed without SQL injections. The
longer way would be to have an insert statement for every item within the data
dictionary. Now all the data is in your starnow database!

Querying #

Well with all that data in your database, what do you want to do with it? Fear
not, querying it with psycopg2 is also very simple. Website searches often let
you filter by gender, and that looks like: cur.execute("SELECT * FROM starnow WHERE gender='Male'") male = cur.fetchall() And with that you have
all the scraped male models ready for whatever purposes you'd like.

Using numpy and matplotlib you can create simple histograms to look at
distributions.

male_height A simple histogram showing the distribution of male model
heights on StarNow.au

What next? #

Well that sums up a very simple and quick intro to scraping with Python,
BeautifulSoup4 and PostgreSQL. As stated earlier, this is only one of the many
options available to you for scraping; the same results could easily have been
accomplished using Python with Scrapy and SQLite. With all this
physical details data, you can now do lots of fun stuff like create
recommendation engines based off physical details, look at the statistics for
models in the database, create your own search engine app for the database,
the list goes on. Having a database and querying it is the framework of most
online web applications these days. If you want to do more, creating more
tables within the database for more data like the acting and modeling
credentials is good practice as well.

← Home