python-planet-blog - Tumblr blog

python-planet-blog · 8 years ago

Text

Rundown of Selenium and BeautifulSoup (BS4)

Hi everyone. For the first post here I am going to cover a topic that I've seen people ask about in online Python communities such as r/learnpython. This is a quick rundown of how to use Selenium and BeautifulSoup to interact with websites and parse html. Now, these concepts can be applied to do anything from scrape the web to automate processes and make bots.

Honestly, this first post is all the info you need to begin effectively working with these modules.

Let's start with

SELENIUM

Selenium is a web testing module that can be used to interact with web elements, which has applications such as the ones mentioned above.

To import selenium, I like to do the following:

from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException, NoSuchElementException, UnexpectedAlertPresentException, WebDriverException from selenium.webdriver.chrome.options import Options

It doesn't add much load to import all of these, and you'll end up using most (probably all) if you are working with Selenium in any substantial way.

The last one is only necessary if you're using Chromedriver, not Firefox. I would reccomend chromedriver as it seems a bit faster and cleaner to me.

Next, we need to initailize our WebDriver object:

opts = Options() opts.add_argument("user-agent='your-user-agent") driver = webdriver.Chrome('/path/to/chromedriver', chrome_options = opts) driver.wait = WebDriverWait(driver, 15)

A few things here. One, where it says 'your-user-agent,' you should put your user agent (shocker). This isn't strictly necessary, but often the default python requests user agent will get blocked/rate limited by sites as everyone using python at a given moment without specifying their user agent is telegraphing that user agent.

To get your user agent, google 'what is my user agent.'

If you're using chromedriver, you need to put the path to where chromedriver executable is on your machine.

Otherwise, for Firefox:

profile = webdriver.FirefoxProfile() profile.set_preference('general.useragent.override','your-user-agent') driver = webdriver.Firefox(profile) driver.wait = WebDriverWait(driver, 15)

The driver.wait line sets the implicit wait for the WebDriver object. I'll get to that in a second.

So, now we've got the driver initialized, so let's interact with some web elements.

driver.get(url)

This opens url in our webdriver.

Interacting with web sites via Selenium/bs4 is, like much of programming, consists largely of telling python what things are and then what to do to them.

To use Selenium (and bs4) you must use the Web Inspector to analyze your webpage and find out how to identify web elements. I reccomend using the web inspector in either Safari or Chrome, as these browsers offer the handy 'copy xpath' functionality.

We can identify web elements in a number of ways. We can use HTML tag attributes such as name, id, class name, tag name. We can use XPATH or CSS Selector. There are other options too, all listed here.

Tag names look like this:

The 'a' is the tag name. The thing after class is the class. All other tag attrs (name, id etc) work like that. They're in orange/yellow in the safari web inspector.

CSS Selectors look like this:

a.title.may-blank.outbound

XPATHs look like this:

//*[@id="new_post_buttons"]/div[4]/div[2]/div/div[5]/div[1]/div/div[3]/div/div/button

You can get CSS selectors by mousing over elements in the Web Inspector. You can get xpath by selecting an element (click the target thing in Safari or the arrow thing in Chrome, and then click the desired element) and then right clicking the corresponding html (it will become highlighted) and clicking 'copy xpath.'

The syntax to find elements is as follows:

This returns the first matched web element (going down the source HTML):

link = driver.find_elements_by_xpath(’this-xpath’)

And this returns a list of all matched elements:

links = driver.find_elements_by_tag_name('a')

Detailed syntax, showing similar underscore syntax ways to find by the aforementioned parameters, can be found at the docs link from earlier.

This approach, using find_elements_by_xyz, uses an implicit wait (remember that from before?)

When the driver is told to find an element, it must first wait for the element to be loaded. Using an implicit wait means it will wait up to n seconds specified in

driver.wait=WebDriverWait(driver, n)

before throwing a TimeoutException.

This is NOT THE BEST PRACTICE. I really only use implicit waits when I need a list of all the matched elements:

time.sleep(5)

my_xyzs = driver.find_elements_by_xyz(’my-xyz’)

my_fav_xyz = my_xyzs[9]

Technically, you shouldn't need the time.sleep(5)... but implicit waits can be inconsistent, so I throw it in there to make sure the page has loaded by the time Selenium looks to construct the list of matching elements.

So, most of the time, you should use explicit waits. Instead of finding elements via the find_element(s) commands, use:

elem = driver.wait.until(EC.element_to_be_clickable( (By.XPATH, 'my-xpath')))

Again, full syntax available which details all possible excepted conditions (the EC). This means that you can wait for the element in question to be clickable, visible, present, stale..you have a lot of options. Similarly, elements can be designated for waits By.XPATH, tag name, class name, css selector, and more.

But what if the element we need is only differentiated by an esoteric html tag attribute, I hear you lament.

Not to worry. We can use XPATH to designate a web element by ANY tag attribute.

elem = driver.wait.until(EC.visibility_of_element_located( (By.XPATH, '//[@attr="value"]')))

The above code designates elem as the web element on that page with tag attribute attr equal to 'value'. You can put a tag name in between the // and [ to specify further:

'//div[@attr='value']'

This finds only div tags with attr='value'.

Once we have identified our web element by an HTML attribute, xpath, or css selector, and defined it in Python using selenium syntax, we can do many things to it:

elem.click() #clicks elem elem.send_keys('abc') #types 'abc' into elem elem.get_attribute('href') # gets the 'href' attr of elem elem.send_keys(Keys.COMMAND, 'v') #pastes -- all keyboard shortcuts are similarly available

One caveat on shortcuts: ChromeDriver on OS X does not support most keyboard shortcuts. If you have to paste on os x with chromedriver, the following will get the job done:

elem.send_keys(Keys.SHIFT, Keys. INSERT)

It doesn't matter if your Mac doesn't have an insert key-- windows shortcuts seem to work on mac selenium. I imagine other shortcuts can be used on chromedriver using this workaround.

To get the html of a page loaded in the driver:

driver.page_source

Other commands I use relatively often:

driver.back() #goes back driver.quit() #quits

----THAT'S IT!

I mean, there's more to Selenium, but that's more than enough info for you to discover the rest on your own.

BeautifulSoup

A lot of the HTML stuff from up there will translate well to bs4 as well. bs4 is used to parse HTML. If you want to scrape info from a website, or whatever, bs4 is going to help you do it. The syntax is VERY straightforward-- gotta love Python.

Like any great chef (and the bs4 docs) will tell you, first we need to make the soup.

from bs4 import BeautifulSoup as bs4 driver.get('https://www.reddit.com') soup = bs4(driver.page_source, 'html.parser')

So what's going on here? First, we import bs4. Then we use Selenium to open a URL. We then create our soup object. First argument is driver.page_source, meaning we want to parse the source html of the current driver page. Then, 'html.parser' specifies which parser we want to use. You can use lxml if you want. I have no idea what the difference is. If one isn't working, try switching-- this has never been a problem for me.

Go ahead and print

(soup.prettify())

to see what's what here-- it'll be a bunch of HTML. You can print (soup.text()) to get just the text.

Ok, so how do we actually parse the HTML? We use the find() and find_all() methods.

links = soup.find_all('a')

Both find and find_all accept a tag name as the first argument. The second argument is class_ (underscore to differentiate from Python classes).

account_links = soup.find_all('a', class_='account')

The difference between find() and find_all() is that find() returns the first match and find_all() returns a list of matches.

As before, we can find()/find_all() by ANY tag attr, this time by passing a dict:

names = soup.find_all('a', attrs={'id':'name'})

I find that SO nice. Hope you do too.

Now, these methods return tag(s). Meaning

soup.find('div')

will find the first 'div' tag in the HTML, and return everything between its beginnning <div> and its end </div>

I find that we rarely want the whole content of the tag haha. So, to grab just the TEXT in this tag, we can do:

soup.find('div').text

Or, to get the value of any tag attribute:

soup.find('a')['href'] #replace 'href' with whatver tag 'attr' you want to get the value of

FINALLY, a helpful tactic in web scraping: narrow your search!

If you want to go down LinkedIn Recruiter search results and grab everyone's name, first make a list of all the profile cards, and then look in each one for the name. That way you decrease the amount of 'p' tags (or whatever) in your search area and make it easier to grab the right ones.

e.g.:

cards = soup.find_all('div',class_='profile_card') for card in cards: name = card.find('p')

And, actually, a helpful tactic in building bots/automating processes: you can use bs4 to scrape a website and make the bots job easier. If how to interact with the web elements to get your desired outcome is not immediately clear, pull up the web inspector and see if the link (or whatever, ember perhaps) you need is stored in the HTML somewhere! Then you can just pull

driver.page_source

with bs4 and parse out what you need. Often, link 'suffixes' such as '/post/comments/12314141/this-is-a-great-post-man' will be stored in the 'href' attrs of HTML tags. You can parse that out and store it in link, and then do

driver.get(url+link)

to save you some hassle. Just a thought.

Anyway, I hope you got some value from this. If so, LMK! I might make videos doing some examples or respond to specific questions or just otherwise maintain some sort of presence in this line of content.

#python #howto #how #to #beautifulsoup #bs4 #selenium #web scraping #tutorial

1 note · View note