by Samip Karki and Alexander Matoushek
Our project is inspired by the paper by Diniz et al "The Sticker Collector's Problem (https://www.tandfonline.com/doi/abs/10.4169/college.math.j.47.4.255)
In this paper, the authors investigate the following problem:
If there are 649 distinct stickers of players in the Fifa World Cup and a single package has 5 distinct stickers, what is the average number of packages you would need to buy to get the full collection of 649 distinct stickers?
Let M be the numbers of distinct stickers in the whole collection.
Let n be the numbers of stickers in one package.
The first thing we needed to do was import all of the neccessary packages we needed for our functions.
import random
import numpy as np
import matplotlib.pyplot as plt
import math
import time
from scipy import stats
Before creating a function that will give us the desired number of packs needed to complete the set, we first wanted to try to use our method using a much smaller example than M = 649 and n = 5. We chose M = 5 and n = 2, and printed each step of the stickers being added to our list to make sure it's working properly. We included 0 as one of the stickers, meaning we needed a list from 0 to M-1, because this was easier than making a list of 1 to M.
distinctStickers = 5
packSize = 2
myStickers = [0] * distinctStickers
k = 1
while myStickers.count(0) != 1:
package = random.sample(range(0,distinctStickers),packSize)
print(myStickers)
for i in package:
if i not in myStickers:
myStickers[i] = i
k += 1
print(myStickers)
print(k)
[0, 0, 0, 0, 0] [0, 0, 0, 3, 4] [0, 0, 0, 3, 4] [0, 0, 0, 3, 4] [0, 1, 0, 3, 4] [0, 1, 2, 3, 4] 6
The method tried above gives us the correct number of packs needed for our given random sample of sticker packages, meaning we should be able to easily implement this method as a function.
#input M and n, then returns the number of packages that completed the set
def stickers(distinctStickers, packSize):
myStickers = [0] * distinctStickers
k = 0
while myStickers.count(0) != 1:
package = random.sample(range(0,distinctStickers),packSize)
for i in package:
if i not in myStickers:
myStickers[i] = i
k += 1
return k
Let's test our above function to make sure it gets us an estimate of how many packs the sticker collection requires that is roughly in the right ballpark. The analytical solution to this problem given in the paper is 913 packages. Obviously, the number from our function won't be exactly 913 every time, but if we run this simulation many times and find the average of them, we should get a number very close to 913.
stickers(649, 5)
935
Our function told us that for this specific random simulation, you would need 935 packs to get the complete set, which is close to what our population mean is. This means our function is indeed getting us a correct approximation and is an appropriate estimation. This will allow us to make large lists of simulations for various pack sizes, and find the means of those lists to get sample means and approximations of the packs required, as well as analyze some other features of the list.
A small additional thing we wanted to look into was the runtimes of our simulations as we changed the stickers per pack going forward. To do this, we decided to use Python features to record the time before creating each list for each pack size, and then the time after creating each list, and save these as the times. We found the runtimes and made the list at the same time to save time and also to make sure the runtimes we were recording were accurate. Because we were creating a new list inside of each runtime calculation, we did not use a function for this.
Now, we're ready to start analyzing our various lists of packs for different pack sizes, and start looking for trends!
We started by creating our first list, where pack size is 5.
startTime = time.time()
stickerList = [stickers(649,5) for i in range(10000)]
endTime = time.time()
#print("runtime:", endTime - startTime)
runTime = endTime - startTime
print(runTime)
423.47702288627625
Note that the runtime for this list was 423 seconds. We can use a sum function to help us find the mean of the list, which is our sample mean number of packs required to complete the set with our given pack size. This is what should give us an accurate approximation of the true mean.
listMean = sum(stickerList)/10000
print(listMean)
916.1791
We obtained a sample mean of 916.18, which is very close to the true mean of 913.1. If we ran even more than 10,000 simulations, we would get an even more accurate approximation, but this would take too much time given our limitations. Let's make a histogram of our 10,000 simulations to see what sort of distribution it follows.
plt.hist(stickerList, bins = 50)
plt.xlim(0,2000)
plt.xlabel("number of packs bought to complete the set")
plt.ylabel("frequency");
The resulting histogram is fairly normal, with a slight right skew. It is good to see that the data is very centered around where it should be, the 800-900 range.
Making a list of proportions
The next thing we wanted to implement was a list of proportions, specifcally the proportion of the list that completed the set within each different number of packs. This list would allow us to find the probability of completing the set with any given number of packs bought, anywhere from the smallest amount it took to complete the set to the largest amount it took to complete the set. To figure out this function, we again started with a very small example list.
test = np.array([1,1,2,3,4])
props = np.zeros((5,2))
start = 0
stop = 5
sum = 0
for i in range(start,stop):
props[i,0] = i
sum = sum + np.count_nonzero(test == i)
props[i,1] = sum/(stop-start)
props
array([[0. , 0. ], [1. , 0.4], [2. , 0.6], [3. , 0.8], [4. , 1. ]])
The array in the output above shows that the probability of completing the set in 1 pack in our example is 0.4, which is correct because 2/5 numbers in our list are 1s. The probability is also 1 after 4 packs, because if you buy 4 packs, you will have completed the set in all 5 trials in our example. This method is working, so it is now time to implement this as a function.
#input a list of many simulations, output is coordinates for a probability graph
def get_proportions(stickerList):
stickerList = np.sort(stickerList)
start = stickerList.min()
stop = stickerList.max()
props = np.zeros((stop, 2))
sum = 0
for i in range(start, stop):
props[i,0] = i
sum = sum + np.count_nonzero(stickerList == i)
props[i,1] = sum/len(stickerList)
return props
Let's make a slightly larger test list using our sticker function and plot the poroportion list to see if it is working properly. Based on the findings in the article, this plot should see an asymptotic trend where the porbability is 0 for a while, and then somewhere near the mean should pretty rapidly shoot up to 1.
test_list = [stickers(100,5) for i in range(10000)]
test_list
test_plot=get_proportions(test_list)
plt.plot(test_plot[:,0],test_plot[:,1]);
Our plot for a sticker amount of 100 and a pack size of 5 does indeed show the relationship we want it to, where the probabilty of a full set is 0 until about 60 packs bought, and by about 160 packs it is about 1. Note that in these probability plots, the point where probability is 0.5 (roughly 100 here) is what we should get as the sample mean of our list, as that is the amount of packs where you should complete the set half the time. Let's test our proportion function with our newly formed stickerList list for M = 649 and n = 5.
get_proportions works correctly
test_plot = get_proportions(stickerList)
plt.plot(test_plot[:,0],test_plot[:,1])
plt.xlabel("packs bought")
plt.ylabel("probability of completing the full set");
We can again see the asymptotic relationship here, and we can also see that the point where probability = 0.5 is around 900 like it should be. We can have full confidence that our probability function is working properly.
We also wanted to find the standard error for each of our lists to find how accurate they are for predicting the true population mean. Standard error is different from standard deviation because rather than comaring individual values, it compares a sample mean to the full population mean. To do this, we can use a function from the built in "stats" package in Python.
error1 = stats.sem(stickerList)
Knowing that our proportion function can tell us the probability of completing the set with any given amount of packs bought, we wanted to come up with a way to find what number of packs gives you a probability of at least 0.99. The number we found earlier of 916 was the mean, but this only gave you a 50% probability of completing the set, which isn't that high. To fully complete the set, you'd need to buy more, but how many more? To answer this question, we implemented a function that finds the first pack amount in the proportion list that is associated with a probability of at least 0.99.
def find99(list):
for i in range(len(list)):
if list[i,1] >= 0.99:
return i
find99(test_plot)
1448
You need 1,448 packs to be 99% sure of completing the set with 5 stickers per pack, which makes sense given where the plot we made appears to be approaching 1.
Now that we have confirmed the findings of Diniz, Lopes, Polpo, and Salasar, we wanted to explore some other amounts of stickers per packs, and analyze some trends as you change the stickers per pack (n). We chose several of the pack sizes used in the article, with the exception of 50, which we chose because the article used 61 and 62 and we prefered to check a more round number.
startTime = time.time()
stickerList2 = [stickers(649,1) for i in range(10000)]
endTime = time.time()
#print("runtime:", endTime - startTime)
runTime2 = endTime - startTime
print(runTime2)
841.7757334709167
The runtime for packs of size 1 was way higher than pack size 5, taking over 14 minutes to complete. This makes sense, because a pack size of 1 gives you very high variability, and it could take thousands and thousands of packs to complete the set even if you just have a few missing stickers.
listMean2 = sum(stickerList2)/10000
print(listMean2)
4585.2128
Like I mentioned above, the average packs needed for pack size 1 is over 4,500, which is about 5 times larger than the amount needed for packs of size 5.
plt.hist(stickerList2, bins = 50)
plt.xlim(0,12000)
plt.xlabel("number of packs bought to complete the set")
plt.ylabel("frequency");
The associated histogram for this distribution sees the same roughly normal distribution, with a slight right skew still. Something interesting to note is that this histogram is much wider, with the data mostly falling within a range of about 3,000-7,000, while for pack size 5 the data fell within a range of around 500-1,500.
error2 = stats.sem(stickerList2)
test_plot2 = get_proportions(stickerList2)
plt.plot(test_plot2[:,0],test_plot2[:,1])
plt.xlabel("packs bought")
plt.ylabel("probability of completing the full set");
All of the probability plots will show the same trend, but we will eventually look to overlay all of them on top of each other to compare them. Like mentioned above, this plot definitely appears to be stretched out more, and the transformative part appears to be wider than for pack size 5.
find99(test_plot2)
7219
You need 7,219 packs to be 99% sure of completing the set with 1 sticker per pack.
startTime = time.time()
stickerList3 = [stickers(649,10) for i in range(10000)]
endTime = time.time()
#print("runtime:", endTime - startTime)
runTime3 = endTime - startTime
print(runTime3)
399.7271192073822
The runtime here is slightly less than for pack size 5, which is what we would expect.
listMean3 = sum(stickerList3)/10000
print(listMean3)
455.4349
The average number of packs needed that we got was 455.4, which is very close to the exact answer of 455.0 the authors of the article got.
plt.hist(stickerList3, bins = 50)
plt.xlim(0,1500)
plt.xlabel("number of packs bought to complete the set")
plt.ylabel("frequency");
We see another similar histogram, where the data is roughly normal with a slight right skew. This also appears to confirm what we theorized earlier, where the data is becoming more narrow as we increase the pack size. This is because with more stickers per pack, you are more likely to get the full set in the same amount of packs every time since you have a better idea of which stickers will be in each pack.
error3 = stats.sem(stickerList3)
test_plot3 = get_proportions(stickerList3)
plt.plot(test_plot3[:,0],test_plot3[:,1])
plt.xlabel("packs bought")
plt.ylabel("probability of completing the full set");
Our probability plot again looks similar to what we've already seen.
find99(test_plot3)
709
You need 709 packs to be 99% sure of completing the set with 10 stickers per pack.
startTime = time.time()
stickerList4 = [stickers(649,50) for i in range(10000)]
endTime = time.time()
#print("runtime:", endTime - startTime)
runTime4 = endTime - startTime
print(runTime4)
308.27709674835205
The runtime continues to decrease as we increase the amount of stickers per pack. This is likely because we are needing less and less stickers to complete the set as we go.
listMean4 = sum(stickerList4)/10000
print(listMean4)
88.3635
For packs of size 50, on average you need 88.4 packs to complete the set. The article did not check pack size 50 so we do not have a reference for this sample mean.
plt.hist(stickerList4, bins = 30)
plt.xlim(0,300)
plt.xlabel("number of packs bought to complete the set")
plt.ylabel("frequency");
The histogram looks the same yet again, with a narrow distribution around roughly the mean, with a slight right skew.
error4 = stats.sem(stickerList4)
test_plot4 = get_proportions(stickerList4)
plt.plot(test_plot4[:,0],test_plot4[:,1])
plt.xlabel("packs bought")
plt.ylabel("probability of completing the full set");
All of the probabilty plots look essentially the same by themselves, with the exception of the scale. It will be interesting to compare them all to each other at the end.
find99(test_plot4)
138
You need 138 packs to be 99% sure of completing the set with 50 stickers per pack.
startTime = time.time()
stickerList5 = [stickers(649,100) for i in range(10000)]
endTime = time.time()
#print("runtime:", endTime - startTime)
runTime5 = endTime - startTime
print(runTime5)
349.13524198532104
It took 349.135s to do 10,000 simulations of M = 649, n = 100
listMean5 = sum(stickerList5)/10000
print(listMean5)
42.5329
The sample mean of 100 stickers per pack is 42.53
plt.hist(stickerList5, bins = 25)
plt.xlim(0,100)
plt.xlabel("number of packs bought to complete the set")
plt.ylabel("frequency");
We can see that the spread of the histograms are getting smaller as we are increasing n
error5 = stats.sem(stickerList5)
test_plot5 = get_proportions(stickerList5)
plt.plot(test_plot5[:,0],test_plot5[:,1])
plt.xlabel("packs bought")
plt.ylabel("probability of completing the full set");
Probability of completing the collection with x number of packages bought, with M = 649, n = 100
find99(test_plot5)
66
You need 66 packs to be 99% sure of completing the set with 100 stickers per pack.
startTime = time.time()
stickerList6 = [stickers(649,500) for i in range(10000)]
endTime = time.time()
#print("runtime:", endTime - startTime)
runTime6 = endTime - startTime
print(runTime6)
203.04116225242615
It took 203.041s to make 10,000 simulations of M = 649, n = 500
listMean6 = sum(stickerList6)/10000
print(listMean6)
5.3163
The sample mean of the 10,000 simulations is 5.316
plt.hist(stickerList6, bins = 10)
plt.xlim(0,20)
plt.xlabel("number of packs bought to complete the set")
plt.ylabel("frequency");
We see the same thing mentioned before, the spread of the obvervations is getting smaller as n increases.
error6 = stats.sem(stickerList6)
test_plot6 = get_proportions(stickerList6)
plt.plot(test_plot6[:,0],test_plot6[:,1])
plt.xlabel("packs bought")
plt.ylabel("probability of completing the full set");
Probability of completing the set for a given amount of packs bought with packs of size 500
find99(test_plot6)
8
You need 8 packs to be 99% sure of completing the set with 500 stickers per pack.
n = 649 is a trivial case. Obviously, if a package has 649 stickers, then you would only need a single package to get the whole collection of stickers. The paper also looked at this case, so we will also for completeness sake.
startTime = time.time()
stickerList7 = [stickers(649,649) for i in range(10000)]
endTime = time.time()
#print("runtime:", endTime - startTime)
runTime7 = endTime - startTime
print(runTime7)
75.48822951316833
It took 75.488s to do 10,000 simulations of n = 649
listMean7 = sum(stickerList7)/10000
print(listMean7)
1.0
The sample mean is 1.0, as expected
plt.hist(stickerList7, bins = 5)
plt.xlim(0,5)
plt.xlabel("number of packs bought to complete the set")
plt.ylabel("frequency");
All of the simulations completed the collection with a single pack, as expected.
error7 = stats.sem(stickerList7)
test_plot7 = get_proportions(stickerList7)
plt.plot(test_plot7[:,0],test_plot7[:,1])
plt.xlabel("packs bought")
plt.ylabel("probability of completing the full set");
print(find99(test_plot7))
None
You need 1 pack to be 99% sure of completing the set with 649 stickers per pack (not sure why this isn't showing up as such).
plt.plot(test_plot2[:,0],test_plot2[:,1], label = "1")
plt.plot(test_plot[:,0],test_plot[:,1], label = "5")
plt.plot(test_plot3[:,0],test_plot3[:,1], label = "10")
plt.plot(test_plot4[:,0],test_plot4[:,1], label = "50")
plt.plot(test_plot5[:,0],test_plot5[:,1], label = "100")
plt.plot(test_plot6[:,0],test_plot6[:,1], label = "500")
plt.plot(test_plot7[:,0],test_plot7[:,1], label = "649")
plt.xlabel("packs bought")
plt.ylabel("probability of completing the full set")
plt.legend(loc="lower right", title = "pack size");
The plot above shows the probability of completing the collection for x number of packs bought for various packsizes.
Like we have mentioned before, as n increases, it becomes easier to complete the whole collection a given number of packs bought. We can also see that as n increases, the curve will make a sharper transition between probabilty of 0 and probabilty of 1. This makes sense, because we saw from the histograms that the spread of the samples got smaller as n increased. A small spread in the histogram corresponds to a sharper transition between probabilty of 0 and probability of 1.
#list to serve as the x values for all the points plotted in the plots below
packSizeList = [1,5,10,50,100,500,649]
packsBoughtList = []
packsBoughtList.append(listMean2)
packsBoughtList.append(listMean)
packsBoughtList.append(listMean3)
packsBoughtList.append(listMean4)
packsBoughtList.append(listMean5)
packsBoughtList.append(listMean6)
packsBoughtList.append(listMean7)
print(packsBoughtList)
[4585.2128, 916.1791, 455.4349, 88.3635, 42.5329, 5.3163, 1.0]
plt.plot(packSizeList, packsBoughtList)
plt.xlabel("stickers per pack")
plt.ylabel("average packs needed to complete the set");
Average packs needed to complete the set for a given amount of sticker per pack.
#list without 1 so trend is clearer (pack size of 1 so much higher than rest of data)
packSizeList2 = [5,10,50,100,500,649]
packsBoughtList2 = []
packsBoughtList2.append(listMean)
packsBoughtList2.append(listMean3)
packsBoughtList2.append(listMean4)
packsBoughtList2.append(listMean5)
packsBoughtList2.append(listMean6)
packsBoughtList2.append(listMean7)
print(packsBoughtList2)
[916.1791, 455.4349, 88.3635, 42.5329, 5.3163, 1.0]
#zoom in on section of plot where most of the change is happening
plt.plot(packSizeList2, packsBoughtList2)
plt.xlim(0,150)
plt.xlabel("stickers per pack")
plt.ylabel("average packs needed to complete the set");
Same plot as before, excuding the first point to see the trend better.
runTimeList = []
runTimeList.append(round(runTime2,5))
runTimeList.append(round(runTime,5))
runTimeList.append(round(runTime3,5))
runTimeList.append(round(runTime4,5))
runTimeList.append(round(runTime5,5))
runTimeList.append(round(runTime6,5))
runTimeList.append(round(runTime7,5))
print(runTimeList)
[841.77573, 423.47702, 399.72712, 308.2771, 349.13524, 203.04116, 75.48823]
The above list shows the associated runtimes for each successive n value we tested.
plt.plot(packSizeList, runTimeList)
plt.xlabel("stickers per pack")
plt.ylabel("runtime (seconds)");
Relationship of runtime and stickers per pack. Note there appears to be a slight outlier for the n = 50 point, where the runtime is 308.2771 seconds, when based on the trend it should be around 370/380 seconds. This is represented visually by the dip in the plot at x = 50.
errorList = []
errorList.append(round(error2,5))
errorList.append(round(error1,5))
errorList.append(round(error3,5))
errorList.append(round(error4,5))
errorList.append(round(error5,5))
errorList.append(round(error6,5))
errorList.append(round(error7,5))
print(errorList)
[8.35813, 1.66933, 0.81337, 0.1585, 0.07505, 0.00921, 0.0]
The above list shows the associated standard errors for each successive n value we tested.
plt.plot(packSizeList, errorList)
plt.xlabel("stickers per pack")
plt.ylabel("standard error");
#list without 1 so trend is clearer (pack size of 1 so much higher than rest of data)
errorList2 = []
errorList2.append(round(error1,5))
errorList2.append(round(error3,5))
errorList2.append(round(error4,5))
errorList2.append(round(error5,5))
errorList2.append(round(error6,5))
errorList2.append(round(error7,5))
print(errorList2)
[1.66933, 0.81337, 0.1585, 0.07505, 0.00921, 0.0]
#zoom in on section of plot where most of the change is happening
plt.plot(packSizeList2, errorList2)
plt.xlim(0,150)
plt.xlabel("stickers per pack")
plt.ylabel("standard error");
The larger the pack size is, the more accurate our approximation is of the true mean. This culminates with the pack size of 649, where, given that packs cannot have duplicate stickers, will give you the full set in 1 pack exactly every single time. Our sample means for the smaller pack sizes are a bit further away from the true population expected packs needed.
Our findings can be summarized with a single table:
This can be compared to the table in the paper by Diniz et al
The table from the paper compared the average number of packs needed to complete the collection for various pack sizes. They included both an exact analytical answer and an answer from their Monte Carlo Simulation.
Comparing some of our values to theirs, we can confidently say we recreated their data quite well. For n = 1, 5, 10 , 100, and 500, the value we obtained from our Monte Carlo Simulation was within a standard error of the exact value.
If we had more time with this project, here are some other things we could investigate: