User Tools

Site Tools


docu:csheet:sysadm:script:python:html_scraping

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
docu:csheet:sysadm:script:python:html_scraping [2022/01/05 13:57]
admin
docu:csheet:sysadm:script:python:html_scraping [2022/01/16 02:13] (current)
admin clearer examples on attributes
Line 2: Line 2:
  
 \\ \\
-Install the package required (Debian)Install <code>bs4</code> using pip, otherwise+Install the packages required. 
 <code bash> <code bash>
 +# Debian-based
 apt install python3-bs4 apt install python3-bs4
 +apt install python3-requests
 +
 +# using pip
 +pip install bs4
 +pip install requests
 </code> </code>
 +
 +\\
 +Sample code for parsing:
 +
 +<code python>
 +# obtain html using requests
 +response = requests.get('http://example.org')
 +html = BeautifulSoup(response.text, 'html.parser')
 +
 +# get page title
 +print(html.title)
 +
 +# select using DOM selector (list of elements)
 +elements = html.select('#your-id .your-class a[href="value"]')
 +
 +# examples on findings
 +if len(elements) > 0:
 +    # get "href" or "src"
 +    print(elements[0].get('href'))
 +    print(elements[0].get('src'))
 +
 +    # or get using dictionary:
 +    print(elements[0]['class'])
 +    print(elements[0]['style'])
 +
 +    # get text of DOM
 +    print(elements[0].get_text())
 +    print(elements[0].string)
 +</code>
 +
 +Documentation on BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
docu/csheet/sysadm/script/python/html_scraping.1641391054.txt.gz · Last modified: 2022/01/05 13:57 by admin