User Tools

Site Tools


docu:csheet:sysadm:script:python:html_scraping

Simple guide for HTML Web Scraping


Install the packages required.

# Debian-based
apt install python3-bs4
apt install python3-requests
 
# using pip
pip install bs4
pip install requests


Sample code for parsing:

# obtain html using requests
response = requests.get('http://example.org')
html = BeautifulSoup(response.text, 'html.parser')
 
# get page title
print(html.title)
 
# select using DOM selector (list of elements)
elements = html.select('#your-id .your-class a[href="value"]')
 
# examples on findings
if len(elements) > 0:
    # get "href" or "src"
    print(elements[0].get('href'))
    print(elements[0].get('src'))
 
    # or get using dictionary:
    print(elements[0]['class'])
    print(elements[0]['style'])
 
    # get text of DOM
    print(elements[0].get_text())
    print(elements[0].string)

Documentation on BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

docu/csheet/sysadm/script/python/html_scraping.txt · Last modified: 2022/01/16 02:13 by admin