Type a website url into the box below to access contact information.

Scraper App

In this tutorial, we display the use of regexes and beautiful soup to scrape information from websites. Rather than explain each detail of the necessary code, we present the fundamental and novel pieces that the reader may find interesting. Readers interested in reproducing the app are encouraged to visit the full github code for more information.

Dependencies

We've excuted the app with Python 3.5 and modules re, urllib3, and bs4. While any version of python 3 will already be equipped with re, the latter modules can be installed with pip.

$pip install urllib3
$pip install beautifulsoup4

Re

We've used regular expressions to pull key pieces of contact information from the text of the website. Many of the regex patterns we've used in our algorithm are reviewed below, but we encourage interested parties to visit the Python documentation for more information.

The re.search function gives us a match whose information can be assessed using match.group

$match = re.search('instagram', 'https://www.instagram.com/')
$match.group(0)

When applying re.search to links on the given website, we searched for the first link that includes the word facebook.

$match = re.search('facebook', link)
$facebookLink = match.group(0)

Different regrex functions were written based on the country of origin of any phone number. Indiviual characters put inside [], square brackets, match any of the individual characters. When considering phone numbers, we would like to match both
867-5309
867 5309
Hence, we will utilize [ -] in our algorithm.

We would also like to match numbers calling long distance or locally.
1 (415) 523-0057
(415) 523-0057
The question mark, ?, to the right of some character(s) matches exactly 0 or 1 repititions of the character(s). Hence, we will utilize 1? in our algorithm.

While \w matches any alphanumeric character, \d matches any digit 0-9. Because we were looking for blocks of 3 or 4 digits, we utilized \d{3} or \d{4}.

We chose to create separate regexes based on the country. Our algorithm will only detects phone numbers from those countries.

$usa= '(1|\+1)?([ -])?(\d{3}|\(\d{3}\))?([ -])?\d{3}([ -])?\d{4}'
$swiss= '(41|\+41)?(0|\(0\))?\d{2}([ -])?\d{3}([ -])?\d{2}([ -])?\d{2}'
$china= '(86|\+86)?([ -])?1\d{2}([ -])?\d{4}([ -])?\d{4}'
$germany = '(49|\+49)?([ -])?(\d{3}|\(\d{3}\))?([ -])?\d{3}([ -])?\d{4}'
$uk = '\(?0\d{2,3}\)?([ -])?\d{3,4}([ -])?\d{4}'

Beautiful Soup

The Beautiful Soup module was used to parse the given website's html. We use urllib3 and a get request to recover the html, and beautiful soup organizes everything to make it easy to search by html tags.

Recovering and Searching Links from website

We begin by pulling out all of the <\a> tags.

$links = soup.find_all('a')

For each link, we can search for pertinent substrungs using the re module as described previously.

$match = re.search('mailto:', link.get('href'))
$if match.group(0):
  output = link.get('href')
  return output.replace("mailto:", "")

Note that the try and except flow controls are necessary in the actual, because re.search may not always run successfully. Because only some websites link the email address with the 'mailto:' substring, we must also use a regex on the text from the website.

$ '(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)'

Visible text from the website

The beautifulsoup get_text function does not respect <r> tags within the html. We replace such tags with newlines so that contact information on separate lines do not blend together.

$output = str(page.data).replace('<br/>', ' ')

We proceed by removing many of the hidden tags from the soup of html. soup = BeautifulSoup(output, "lxml")

$[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]

Applying the get_text function to the entire beautiful soup variable, results in phrases and text running together without proper spacing. To combat this, we consider only the <\p> tags, and apply get_text to each of the contents.

$contents = soup.find_all("p")
$outputText = [content.get_text() for content in contents]
# split up the list further based on new lines or tabs
# having various lines will make it easier to distinguish
# between important information and various words
$lister = []
$for phrase in outputText:
  lister.extend(phrase.split(''))

For example, we don't want foul cases like the one below.
1600 Pennsylvania Avenue NW
Washington, DC 20500
202-456-1111

$htmlstring = '<html><head> </head><body></a>1600 Pennsylvania Avenue NW<br/>Washington, DC20500<br/>202-456-1111</a></body></html>'
$soup = BeautifulSoup(html, 'lxml')
$soup.get_text()
:'1600 Pennsylvania Avenue NWWashington, DC 20500202-456-1111'

Contact Page

Many webpages include particular contact information pages. Our code performs the same search algorithm on the contact page if and only if both a contact page exists, and all contact information was not previously found on the original webpage.

Out regular expression matches contact pages in English, French, German, and Italian.

$match = re.search('(contact)|(kontakt)|(contatti)', link.get('href'))

We run into two major error cases when considering contact page urls.
Case 0: The protocol of the url is not given.
Case 1: Only the path of the contact page url is given.

$def contact(stringer, soup):
# stringer is the original websites
# soup is the beautiful soup variable resulting from the original website
...
  if stringerContact[:1]=='//':
    return 'http:'+stringerContact
  elif stringerContact[0]=='/': # when we have a sublink rather than the full address
    return stringer+stringerContact

We will not include any further information about Flask or Passenger used to display this tutorial. Please visit the github package for the complete code and more information. Email any further questions to jverrette@gmail.com, and thank you for reading!