User Tools

Site Tools


docu:csheet:sysadm:script:python:html_scraping

This is an old revision of the document!


Simple guide for HTML Web Scraping


Install the packages required.

# Debian-based
apt install python3-bs4
apt install python3-requests
 
# using pip
pip install bs4
pip install requests


Sample code for parsing:

# obtain html using requests
response = requests.get('http://example.org')
html = BeautifulSoup(response.text, 'html.parser')
 
# select using DOM selector (list of elements)
elements = html.select('your.css selector[attr="value"]')
 
# examples on findings
if len(elements) > 0:
    # get "value" attribute
    print(elements[0].get('value'))
    # get "href" or "src"
    print(elements[0].get('href'))
 
    # get class
    print(elements[0]['class'])
 
    # get text of DOM
    print(elements[0].get_text())
    print(elements[0].string)

Documentation on BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

docu/csheet/sysadm/script/python/html_scraping.1641391690.txt.gz · Last modified: 2022/01/05 14:08 by admin