Operations Research and US Elections

The subtitle of this blog post should be something like

How to win US Presidential elections with less than 25% of the total eligible popular vote, using Operations Research with python PuLP library.

This is quite verbose, but it summarizes the idea of this blog post.
A couple of weeks ago, while looking for a linear programming (LP) library, I stumbled upon PuLP library, which is a free open source library written in Python with the goal of representing optimization problems as mathematical models. PuLP will act as the interface to several external LP solvers (CBC, GLPK, and a list with some more free/non-free solvers here)

Inspired by this video, and willing to learn more about how PuLP works, I decided to give it a shot at modeling the US elections in terms of Operations Research (OR). Admittedly, the video is a bit old, and the elections are long gone, but for learning the inners of PuLP is good enough.

The data that I've used is the one posted on Kaggle at this link. For metadata details check that link (I used the sqlite data format).
However, I only needed the following columns:

Column Description Type
area_name The name of the state in full format String
AGE295214 Persons under 18 years, percent, 2014 Numeric
PST045214 Population, 2014 estimate Numeric

Using the following SQL query, I transformed the data into a new table as shown below:

SELECT
  area_name,
  PST045214 AS total_population,
  round((100 - county_facts.AGE295214)*county_facts.PST045214*0.01 + 0.5, 0) AS voting_rights,
  round((100 - county_facts.AGE295214)*county_facts.PST045214*0.01*0.5 + 0.5, 0) AS half_voting_rights
FROM
  county_facts
WHERE
  state_abbreviation = '';
Column Description Type
area_name The name of the state in full format String
total_population Population, 2014 estimate Numeric
voting_rights Total population with voting rights
(as estimate from total_population column)
Numeric
half_voting Half of the total population with voting rights Numeric

As you can see, I have already rounded up the half_voting column to make sure that there will be a majority of electorate represented.

Some more data pre-processing. Just for ease of querying, I've created the following schema (the dataset in the format below can be downloaded from the gitHub repo link at the bottom of this page):

-- this table will contain information extracted from Kaggle site

CREATE TABLE "electorate_stats" (
  area_name TEXT PRIMARY KEY,
  total_population INTEGER,
  voting_rights INTEGER,
  half_voting_rights INTEGER
)

-- this table will contain the full name of each state
-- and their respective number of electoral votes
CREATE TABLE "electorate_votes" (
  area_name TEXT PRIMARY KEY ,
  votes INTEGER
)

-- this table will contain the full name of each state
-- their respective number of electoral votes
-- and the votes to population ratio
CREATE TABLE "voting" (
  area_name TEXT PRIMARY KEY ,
  votes INTEGER,
  proportion FLOAT
)

Let's take a real example of how values of voting.proportion column were calculated. This column represents the ratio of electoral votes to population, with 538 being the total electoral votes.

E.g. let's take the following states as examples:

State Total population Voting rights Half voting rights Electorate votes Proportions
Vermont 626562 5005009 252505 3 1408.020446096654
Texas 26956958 19840322 9920161 38 700680.5167286246

To calculate the votes to population ratio we do something like this:
half_voting_rights * electorate_votes / 538 and here is the SQL query that did that:

select
  electorate_votes.area_name,
  electorate_votes.votes,
  electorate_stats.half_voting_rights*electorate_votes.votes/538.0 as proportions
FROM electorate_votes
  JOIN electorate_stats ON electorate_votes.area_name = electorate_stats.area_name
GROUP BY electorate_votes.area_name

As one can see, in order to gain one electoral vote in Vermont, we need roughly 1,400 people, while in Texas we need a bit over 700,000. So if we order this ascending we see that smaller states in terms of population, somehow have a higher impact on the final outcome.

Now that we have the data let's establish a common ground:

  1. All population eligible for vote will vote (this is not realistic, but it simplifies our model)
  2. To win a state we need half + 1 votes of all the eligible population in that state (this doesn't hold true for Maine and Nebraska, but we'll try to deal with it)
  3. To win the presidency we need a minimum of 270 electoral votes out of 538.
  4. The data is an estimate of the population for the year 2014 (so, the result may be a bit different for other year's census)
  5. We have only two candidates represented by blue and red respectively.(most of the time this is true, however when there is a 3rd or 4th candidate, they rarely got any electoral votes)
  6. The column half_voting already represents the majority number of votes.
  7. We assume that there are no faithless electors. (although this is not realistic either, since from time to time there are such cases, however historically, they never influenced the final result)

Now let's think on how to approach this. I'll present below three methods (actually two) using bin packing and knapsack approaches.

1. Bin packing

We have a minimization problem whereby we need to fit uni-dimensional objects into as few bins possible given the fact that the sizes of the bins is known. In our case, we already know that we will be using two bins with sizes of 268 and 270. So we have to see which states we fit in each bin. Word of advice, depending on the solver, you may get different allocations. Nevertheless, all solvers will try to satisfy all constraints (if there is a feasible solution).

Assuming we already have the data in a variable called data as dataframe with two columns cols=['state', 'votes']

1.1. Defining the problem

bins = [268, 270]
binsCount = len(bins)

problem = LpProblem("US Elections 2020", sense=LpMinimize)

1.2. All possible assignment - these are variables which represent a state that can go in one of the two bins (bin 0 or bin 1).

assignments = {
	(state["state"], binNum): pulp.LpVariable(f'({state}, {binNum})', cat=pulp.LpBinary)
	for state in data
	for binNum in range(binsCount)
}

The format is a dictionary that looks like this:

:
('California', 0): ({'state':_'California',_'votes':_55},_0),
('California', 1): ({'state':_'California',_'votes':_55},_1)
:
('Tennessee', 0): ({'state':_'Tennessee',_'votes':_11},_0),
('Tennessee', 1): ({'state':_'Tennessee',_'votes':_11},_1)
:

1.3. Defining the bins as variables

the_bins = pulp.LpVariable.dicts("Bin used", range(binsCount), cat=pulp.LpBinary)

The format of the bins is a dictionary:

{0: Bin_used_0, 1: Bin_used_1}

1.4. Defining the objective

problem += lpSum(the_bins[i] for i in range(binsCount)), "Objective: allocate state votes in bins"

1.5. Defining the constraints - we have two constraints

"""
The electoral votes of a state can only go into one of the two bins
"""
for state in data:
	problem += pulp.lpSum([assignments[(state['state'], i)] for i in range(binsCount)]) == 1, f"{state} in one bin"

"""
The sum of all electoral votes in one bin cannot be bigger than the size of the bin
"""
for i in range(binsCount):
	problem += lpSum(
		[state['votes'] * assignments[(state['state'], i)] for state in data]
	) <= bins[i] * the_bins[i], f'take all {i}'

1.6. Solving - you may get different allocations when using different solvers.

solver = pulp.COIN_CMD()
problem.solve(solver)

This is the last step and the result looks something like this. A quick calculation will show that the red team won with 23.24% of the total population eligible to vote. But can we do better? (I mean can we have a lower % and still win)

Keep in mind that with this approach we have not used at all the vote to population ratios that we calculated. Let's see next if we change the problem into a maximization problem for the blue team using knapsack.

2. Knapsack 1

All states except Maine and Nebraska have a winner takes all approach. However, to simplify the problem , let's give these two states to the blue team. Which means that the team blue will get automatically 9 votes from these two states and we need to gather the rest of the 259 votes from the other states in a total of 268 votes.

Now we are using knapsack approach and we will want to maximize the number of weighted popular votes we put into our bin which has a maximum number of electoral votes of 259 (we automatically get 9 from Maine + Nebraska, which in total will result in 268 electoral votes). So think of the people to votes ratio as weights and the electoral votes as the objects that need to go into the bin.

Our data is in three separate lists representing the state names (state), electoral votes values (votes) and their proportion respectively (proportions).

statevotesproportions
Alabama931313.6431226765
Arizona1152231.57806691449
California551515317.06319702

state = [Alabama, Alaska, Arizona, ... Wyoming]
votes = [9, 3, 11, ... 3]
proportions = [31313.6431226765, 1534.405204460966, ... 1242.68587360594]

2.1. Defining the problem

problem = LpProblem("US Elections 2020", LpMaximize)

2.2. Defining the states as variables

states = pulp.LpVariable.dicts("State", state, cat=pulp.LpBinary)

2.3. Defining the objective

problem += lpSum(s * p for s, p in zip(states.values(), proportions)), 'Objective'

Summing up the weighted electoral votes.

2.4. Defining the constraints

problem += lpSum(s * v for s, v in zip(states.values(), votes)) <= constraint, 'Sum of electoral votes'

The sum of weighted electoral votes should be less than the constraint which is this case is 259.

And just as the previous approach, next step is to solve the problem. Remember that this problem will return the result (a list of states) for the blue team. In order to find the result for the red team, we need to extract the blue team list from the total list of states.

And the result will look something as below. Calculating the % of the popular vote for the red team, will give a 22.28% out of the total population with voting rights. This is a small improvement from the previous approach.

Is it possible to do better?

3. Knapsack 2

The final model is the same as the above, with the only difference that we will assume that all states have a winner take all policy. Maine and Nebraska will not be allocated to the blue team anymore. Instead the solver will place them where it finds fit.

The steps are the same as above, with the only difference that the variable constraint is 268 instead of 259.

And the result is as below, resulting in a win for the red team with 270 votes at only 21.78% of the total population with the right to vote.

I wonder if there is a better solution and/or a better approach.

NOTE: you need to have installed an external solver in order to be able to run this application (CBC, GLPK, and a list with some more free/non-free solvers here). Installing any of those solvers is out of the scope for this blog post.

The application above is based on Flask. And the results will be shown as a list of states that fit the optimization problem.

Here is the source code of the entire analysis. ENJOY!!!