Web Scraping

Getting data off the Internet

Friday, March 5, 2021 · 5 - 7 PM

Tonight's meeting will be about web scraping. Web scraping is basically anything that consists of using automated software ("bots") to get data off of a website.

There's a few rough categories of web scraping, in order from more elegant and "legitimate" to more hacky and sneaky:

1) Hitting a website's public API, probably a JSON API. We'll use Python and the Requests library to do this.

2) Hitting a website's internal/undocumented API.

3) Making requests and extracting data from the HTML. Probably the most proper definition of web "scraping". We'll use Beautiful Soup to do this.

4) Simulating an actual user/browser interacting with the page. Selenium is the main tool for this; we won't cover it in the workshop, but we'll talk about it a bit.

Probably best to come with Python installed! VPN users, we'll have containers set up for you on Greenbank, but there's a limited amount, so local Python is probably preferable.

Meeting link, as always: https://meet.jit.si/SADClubMeetingSpring2021

Event Info

posted	March 5, 2021
sponsor	System Administration and Software Development Club
share
	add to calendar

Recent Events

Server Build Event

May 17 at 3 PM
Machine Learning Meeting

Mar 31 at 4 PM
Web Development Frameworks / Single Page Applications

Mar 10 at 4 PM
Compiled vs. Interpreted Languages Meeting

Feb 24 at 4 PM
Basic Linux Administration

Feb 10 at 4 PM

myUMBC is a UMBC limited public community forum for information sharing and dialogue. As a public institution, UMBC generally may not limit a community member's right to free speech on this forum. UMBC does not endorse the views expressed or information presented here, unless specifically stated in an official UMBC post. Learn more...