Guided Project: Winning Jeopardy

Posted on Wed 08 July 2015 in Projects

In [1]:
import pandas
import csv

jeopardy = pandas.read_csv("jeopardy.csv")

jeopardy
Out[1]:
Show Number Air Date Round Category Value Question Answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams
... ... ... ... ... ... ... ...
19994 3582 2000-03-14 Jeopardy! U.S. GEOGRAPHY $200 Of 8, 12 or 18, the number of U.S. states that... 18
19995 3582 2000-03-14 Jeopardy! POP MUSIC PAIRINGS $200 ...& the New Power Generation Prince
19996 3582 2000-03-14 Jeopardy! HISTORIC PEOPLE $200 In 1589 he was appointed professor of mathemat... Galileo
19997 3582 2000-03-14 Jeopardy! 1998 QUOTATIONS $200 Before the grand jury she said, "I'm really so... Monica Lewinsky
19998 3582 2000-03-14 Jeopardy! LLAMA-RAMA $200 Llamas are the heftiest South American members... Camels

19999 rows × 7 columns

In [2]:
jeopardy.columns
Out[2]:
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
In [3]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
In [4]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text
In [5]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)
In [6]:
jeopardy
Out[6]:
Show Number Air Date Round Category Value Question Answer clean_question clean_answer clean_value
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus 200
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe 200
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona 200
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's in 1963 live on the art linkletter show this c... mcdonalds 200
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams signer of the dec of indep framer of the const... john adams 200
... ... ... ... ... ... ... ... ... ... ...
19994 3582 2000-03-14 Jeopardy! U.S. GEOGRAPHY $200 Of 8, 12 or 18, the number of U.S. states that... 18 of 8 12 or 18 the number of us states that tou... 18 200
19995 3582 2000-03-14 Jeopardy! POP MUSIC PAIRINGS $200 ...& the New Power Generation Prince the new power generation prince 200
19996 3582 2000-03-14 Jeopardy! HISTORIC PEOPLE $200 In 1589 he was appointed professor of mathemat... Galileo in 1589 he was appointed professor of mathemat... galileo 200
19997 3582 2000-03-14 Jeopardy! 1998 QUOTATIONS $200 Before the grand jury she said, "I'm really so... Monica Lewinsky before the grand jury she said im really sorry... monica lewinsky 200
19998 3582 2000-03-14 Jeopardy! LLAMA-RAMA $200 Llamas are the heftiest South American members... Camels llamas are the heftiest south american members... camels 200

19999 rows × 10 columns

In [7]:
jeopardy["Air Date"] = pandas.to_datetime(jeopardy["Air Date"])
In [8]:
jeopardy.dtypes
Out[8]:
Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object
In [9]:
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)
In [10]:
jeopardy["answer_in_question"].mean()
Out[10]:
0.059001965249777744

Recycled questions

The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

In [11]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()
Out[11]:
0.6876260592169776

Low value vs high value questions

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [12]:
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)
In [13]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count
In [14]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected
Out[14]:
[(0, 1),
 (1, 0),
 (0, 1),
 (1, 0),
 (1, 0),
 (0, 2),
 (3, 8),
 (1, 0),
 (0, 1),
 (0, 1)]
In [15]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared
Out[15]:
[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.01052283698924083, pvalue=0.9182956181393399),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.