This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
docu:csheet:sysadm:script:python:html_scraping [2022/01/05 13:57] – created admin | docu:csheet:sysadm:script:python:html_scraping [2022/01/16 02:13] (current) – clearer examples on attributes admin | ||
---|---|---|---|
Line 2: | Line 2: | ||
\\ | \\ | ||
- | Install the package | + | Install the packages |
<code bash> | <code bash> | ||
+ | # Debian-based | ||
apt install python3-bs4 | apt install python3-bs4 | ||
+ | apt install python3-requests | ||
+ | |||
+ | # using pip | ||
+ | pip install bs4 | ||
+ | pip install requests | ||
</ | </ | ||
+ | |||
+ | \\ | ||
+ | Sample code for parsing: | ||
+ | |||
+ | <code python> | ||
+ | # obtain html using requests | ||
+ | response = requests.get(' | ||
+ | html = BeautifulSoup(response.text, | ||
+ | |||
+ | # get page title | ||
+ | print(html.title) | ||
+ | |||
+ | # select using DOM selector (list of elements) | ||
+ | elements = html.select('# | ||
+ | |||
+ | # examples on findings | ||
+ | if len(elements) > 0: | ||
+ | # get " | ||
+ | print(elements[0].get(' | ||
+ | print(elements[0].get(' | ||
+ | |||
+ | # or get using dictionary: | ||
+ | print(elements[0][' | ||
+ | print(elements[0][' | ||
+ | |||
+ | # get text of DOM | ||
+ | print(elements[0].get_text()) | ||
+ | print(elements[0].string) | ||
+ | </ | ||
+ | |||
+ | Documentation on BeautifulSoup: |