Web Scraping with Python
The options #
With a popular programming language like Python, there are many available
tools whenever you have a problem to solve. When it comes to scraping data, I
always find that it's a good skill to have, be it with Python or any other
programming language. We live in an age where the internet contains so much
information, but sometimes that information goes away, or isn't easily
accessible in one location. So when I began looking at the options for my
scraping project, I looked at Scrapy and then I also looked at
BeautifulSoup library.
Scrapy #
Scrapy is a very popular web crawling framework, which means it has a lot more
features to offer compared to a typical html scrape and parse. With Scrapy the
general idea is that you can set spiders loose onto a website, and then the
spiders will crawl around and grab whatever information you want it to. The
framework has also been polished to maximize efficiency, but to be honest I'm
not completely sure if it's always faster. The power of Scrapy is that you can
easily write a spider in minutes, and then set it loose with very general
requirements. This is very useful if say you wanted to grab all images from a
certain domain, and save it to a database, or grab all links from a domain.
The spiders will start off at the beginning page and then crawl their way
through the domain. There also exists lots of documentation for Scrapy, making
it easy to adopt and adjust certain examples for your use.
Why not Scrapy #
Lots of Python users will tell you to learn and use Scrapy for all your
scraping/crawling needs, and sure that might be a good plan, until Scrapy
doesn't work. Since it's a framework, the error messages you get might be very
cryptic, and if you look at the list of issues on their github repo
you'll see that while some get solved, some also don't. So this means that you
might spend a couple minutes (or more) in frustration trying to get your web
crawler going and working and then realize that when you started it with the
conditions you wanted, it gives you some weird error. So it's always good to
have another tool ready to go, which for me is using the Requests library
and BeautifulSoup. Requests makes it easy to do HTTP interaction, and
then BeautifulSoup makes it easy to deal with parsing the tree. For example,
to find all the tables with a certain id or classin the page:
tables = soup.find_all('table', id="tableid")
I usually prefer to use this way of scraping if the data I'm looking for is in
a specific format (like a table) within a page, and the html is messy. I'm not
an expert at scraping, but using BeautifulSoup and Requests in my arsenal of
tools has been important to me. But like most Python problems, use whatever
tool you are more comfortable with.