Differences

This shows you the differences between two versions of the page.

--- docu:csheet:sysadm:script:python:html_scraping [2022/01/05 13:57]
admin
+++ docu:csheet:sysadm:script:python:html_scraping [2022/01/16 02:13] (current)
admin clearer examples on attributes
@@ Line 2: / Line 2: @@
 \\
-Install the package required (Debian). Install <code>bs4</code> using pip, otherwise
+Install the packages required.
 <code bash>
+# Debian-based
 apt install python3-bs4
+apt install python3-requests
+# using pip
+pip install bs4
+pip install requests
 </code>
+\\
+Sample code for parsing:
+<code python>
+# obtain html using requests
+response = requests.get('http://example.org')
+html = BeautifulSoup(response.text, 'html.parser')
+# get page title
+print(html.title)
+# select using DOM selector (list of elements)
+elements = html.select('#your-id .your-class a[href="value"]')
+# examples on findings
+if len(elements) > 0:
+    # get "href" or "src"
+    print(elements[0].get('href'))
+    print(elements[0].get('src'))
+    # or get using dictionary:
+    print(elements[0]['class'])
+    print(elements[0]['style'])
+    # get text of DOM
+    print(elements[0].get_text())
+    print(elements[0].string)
+</code>
+Documentation on BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

NoBIGTech Wiki Técnico