This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| docu:csheet:sysadm:script:python:html_scraping [2022/01/05 13:57] – admin | docu:csheet:sysadm:script:python:html_scraping [2022/01/16 02:13] (current) – clearer examples on attributes admin | ||
|---|---|---|---|
| Line 2: | Line 2: | ||
| \\ | \\ | ||
| - | Install the package | + | Install the packages |
| <code bash> | <code bash> | ||
| + | # Debian-based | ||
| apt install python3-bs4 | apt install python3-bs4 | ||
| + | apt install python3-requests | ||
| + | |||
| + | # using pip | ||
| + | pip install bs4 | ||
| + | pip install requests | ||
| </ | </ | ||
| + | |||
| + | \\ | ||
| + | Sample code for parsing: | ||
| + | |||
| + | <code python> | ||
| + | # obtain html using requests | ||
| + | response = requests.get(' | ||
| + | html = BeautifulSoup(response.text, | ||
| + | |||
| + | # get page title | ||
| + | print(html.title) | ||
| + | |||
| + | # select using DOM selector (list of elements) | ||
| + | elements = html.select('# | ||
| + | |||
| + | # examples on findings | ||
| + | if len(elements) > 0: | ||
| + | # get " | ||
| + | print(elements[0].get(' | ||
| + | print(elements[0].get(' | ||
| + | |||
| + | # or get using dictionary: | ||
| + | print(elements[0][' | ||
| + | print(elements[0][' | ||
| + | |||
| + | # get text of DOM | ||
| + | print(elements[0].get_text()) | ||
| + | print(elements[0].string) | ||
| + | </ | ||
| + | |||
| + | Documentation on BeautifulSoup: | ||