Evaluating Web Scraping Yield in a Job Search

Quick links to other articles in this series:

I’m currently looking for a role in sales operations, preferably at a fast growing, technology-focused company. I figured as I’m so focused on business operations, I might as well use my sales-ops skills to drive my search. I documented the process in this article.

OK. Now that we have a bunch of results pulled from the web, let’s try to make sense of what we’re looking at. This will be a speed-round of data analysis.

# Number packages
import numpy as np
import scipy
import pandas as pd

# Graphics packages
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Tell graphs to look nice

Load up data set.

df = pd.read_csv('./data/prospect_proforma_phase_3.csv')
Job Interest? Company Dupe Employees Source Link
0 Boston Job Fair - February 5 - LIVE HIRING EVE... 0 Coast-to-Coast Career Fairs NaN NaN CareerBuilder http://www.careerbuilder.com/job/JJJ66R684VZMQ...
1 Salesforce Developer 0 CyberCoders NaN NaN CareerBuilder http://www.careerbuilder.com/job/J3S2D771LCK7P...
2 Director of Operations 1 GPAC 0.0 51-200 CareerBuilder http://www.careerbuilder.com/job/J3R2ZX72NS024...
3 Controller 0 Kforce Finance and Accounting NaN NaN CareerBuilder http://www.careerbuilder.com/job/J3V5S05W2QB68...
4 Revenue Finance Manager (Software Co) 0 Kforce Finance and Accounting NaN NaN CareerBuilder http://www.careerbuilder.com/job/J3N0YY6LPBC5G...

Q1. Where did most of the job postings come from?

source_counts = df['Source'].value_counts()

plt.figure(figsize=(7, 7))
plt.pie(source_counts, labels=source_counts.index)
plt.title('Job Posting Distribution by Source')


Comments: I was suprised how many jobs came from Dice. We will need to see what makes it through the next few filters.

Q2. What were the most productive job sources?

source_counts_kept = df[(df['Interest?'] == 1) & (df['Dupe'] == 0)]['Source'].value_counts()
source_yield = (source_counts_kept / source_counts * 100.).sort_values(ascending=False)

CareerBuilder    30.000000
Monster          14.241486
GlassDoor        11.486486
Google Jobs       9.090909
LinkUp            8.866995
Indeed            4.687500
Dice.com          0.836820
Name: Source, dtype: float64

Comments: LinkUp was actually more productive than originally expected. GlassDoor was WAY more productive than expected. Hat tip to those guys and gals for making some great search software. Dice not only returned bad results, but had plenty of duplicate postings too. Monster did a good job of both producing good results and not presenting duplicates.

Q3. What is the company size distribution for SAL (sales accepted lead) job postings?

employee_mapping = {'1-10': 1, '11-50': 2, '51-200': 3,
                    '201-500': 4,'501-1000': 5, '1001-5000': 6, 
                    '5001-10,000': 7, '10,000+': 8}

df['FUNNEL_EMPLOYEES_INT'] = df[(df['Interest?'] == 1) & (df['Dupe'] == 0)]['Employees'].map(employee_mapping)

def get_count_by_company_size(company_size_int):
    return df[(df['FUNNEL_EMPLOYEES_INT'] == company_size_int)]['FUNNEL_EMPLOYEES_INT'].count()

employee_sizes = []
for i in range(1, 9):
    employee_sizes.append((i, get_count_by_company_size(i)))

[(1, 4), (2, 4), (3, 9), (4, 16), (5, 14), (6, 18), (7, 13), (8, 5)]

Data looks good, let’s plot it.

plt.figure(figsize=(10, 6))
plt.xlim(0.5, 9.0)
plt.xticks(range(1, 9), sorted(employee_mapping, key=employee_mapping.get), rotation='vertical')
plt.title('Accepted Job Posting Count by Company Employee Size')



Comments: This is absolutely AWESOME. This is EXACTLY what I hoped to get from my high-velocity search process. The employee size is spot on, the distribution skews toward larger companies (per part 3 of this series), and the quantity of companies in the optimal range is higher than expected. This made my day.

What’s next?

Now, I need to submit my resume to over 100 different job opportunities while at the same time managing my existing funnel of interviews. In all honesty, that’s the easy part!

I will say: looking for a perfect job is hard. Hopefully this series of posts was helpful in outlining my thought process and maybe you were able to glean some learnings from this series.

My deepest thanks goes to the folowing people for helping with this job search and data science series:

  • Ryan Plunkett for competitively introducting me to Seaborn and Jupyter.
  • Mike Redbord for encouraging me to write all this down.
  • Dezrah Blinn for his infectious positivity in the job search, even as he changes careers himself.