Table of Contents
- 0.1. problem formalization
- 0.2. deepfunding scoring
- 0.3. scoring for level 2
- 0.4. appendix: question generation script
0.1. problem formalization
(notation: to match the graph on the deepfunding website, all arrows are in the direction of dependency, i.e. \(P\to Q\) means \(P\) depends on \(Q\).)
We have a tree (well DAG) of depth exactly 2. - Depth 0 is a single node, ethereum — call this \(O\). - Depth 1 is 34 nodes, “seed repositories” — \(A_1,\dots A_{34}\). - Depth 2 are the ~5000 code dependencies of the seed repositories \(B_1,\dots B_{4381}\); with a total of ~15000 edges of the form \((A_i,B_j)\)1
The task of a contestant is to provide: - Level 1: weights \(w_{O\to A}\), denoted like https://github.com/a,ethereum,0.2
such that \(\sum_A w_{O\to A}=1\) [34 outputs] - Level 2: self-weights \(w_{A}\) denoted like https://github.com/a,originality,0.6
[34 outputs] - Level 3: weights \(w_{A\to B}\) denoted like https://github.com/b,https://github.com/a,0.6
such that the sum of \(A\)’s dependencies \(\sum_Bw_{A\to B}=1\) i.e. (note that they add up to 1 and not to \(1-w_{A}\)2) [~15000 outputs]
A juror is given random samples of: - pairs of edges \(((O\to A_1),(O\to A_2))\) for which they give the “relative advantage of \(A_2\) over \(A_1\) to \(O\)” \(j_{(O\to A_1),(O\to A_2)}\) (taken3 as measured in logits). - Depth 1 nodes \(A\) for which they directly estimate the originality score \(j_{A}\). - pairs of edges \(((A\to B_1),(A\to B_2))\) for which they give the “relative advantage of \(B_2\) over \(B_1\) to \(A\)” \(j_{(A\to B_1),(A\to B_2)}\).4
To give values for all of these would take, where \(n_A\) is the number of dependencies of \(A\):
\[\frac{34\cdot(34-1)}{2}+34+\sum_{A}\frac{n_A(n_A-1)}{2}\] I ran a quick script to calculate this from the sample submissions (see “helper csv” below), and it comes out to 8,353,774. In particular this means we would need 8,353,774 events if we directly implemented a distillation market. So we need something smarter — ideally something that doesn’t require more than 15000 questions for the 15000 weights.
0.2. deepfunding scoring
First let me quickly go over how Deepfunding scores the contestants.
The cost for Level-1 answers is \(\left|\log(w_{O\to A_2}/w_{O\to A_1}) - j_{(O\to A_1),(O\to A_2)}\right|^2\) , summed over all pairs \(((O\to A_1),(O\to A_2))\) for which the juror has provided an estimate.
The cost for Level-2 answers is simply \(\left|w_A-j_A\right|^2\) summed over all \(A\) for which the juror has provided an estimate.
The cost for Level-3 answers is again \(\left|\log(w_{A\to B_2}/w_{A\to B_1}) - j_{(A\to B_1),(A\to B_2)}\right|^2\) , summed over all pairs \(((A\to B_1),(A\to B_2))\) for which the juror has provided an estimate.
BLEG: I’m not sure how these are to be weighted. The concrete instructions just say they are summed over all juror samples, but this depends on how many juror samples are taken of each category — this is probably important if we want to weight questions properly. We should ask them, or maybe we can just create another event to ask the market what it thinks they will do :)
This covers how Deepfunding will score our final model submission. As we will see, this does not necessarily straightforwardly translate to how we score miners in preparing our model.
0.3. scoring for level 2
Level-2 is straightforward — we simply create a question for each \(A\) (depth-1 node) asking “how original is \(A\)?” and score as:
\[s(w_A)=\left|w_A-j_A\right|^2\]
if \(j_A\) is elicited and 0 is otherwise. Then perform the peer score adjustment (otherwise the miners are incentivized to just not bet). ## scoring for level 1 and level 3
The basic problem for scoring Level-1 and Level-3 questions is:
- we can only create events for each edge, not for each of the 8 million pairs of edges
- the scoring needs to be “modular” i.e. the total score needs to be reducible to a sum of scoring functions that each depend on only one question. \(\sum\left|\log(w_{A\to B_2}/w_{A\to B_1}) - j_{(A\to B_1),(A\to B_2)}\right|^2\) does not satisfy this property.
Three possible solutions:
0.3.1. straightforward approach
Here’s one idea: for Level-3 we create one event per \(A\) i.e. per depth-1 node (and for Level-1 we analogously create just one event for \(O\)) — to forecast this event the miner reports a value in \(w_{A\to}\in[0,1]^{n_A}\) such that \(\sum w_{A\to}=1\), i.e. weights for all of \(A\)’s dependencies — and this answer is scored in a special way (rather than just as a standard continuous random variable):
\[ s(w_{A\to})=\sum\left|\log(w_{A\to B_2}/w_{A\to B_1}) - j_{(A\to B_1),(A\to B_2)}\right|^2 \] where the sum is over pairs \((A\to B_1,A\to B_2)\) for which the jury ultimately gives an answer.
(and again just make the peer score adjustment as per normal)
This way we have just 34 events for Level-3 and 1 event for Level-1 (in addition to the 34 events for Level-2).
Key questions aka potential problems with tiny numbers: - The answers to these questions might be very high-dimensional vectors — the lowest is 6, the highest is 2277 (full numbers in helper csv section). Could this be an issue? Can LLM-based miners even meaningfully fit so much in their context window (I mean they can but like, usefully)? - How “advanced” are the miners on the network right now? If they’re all just simple LLM callers (without any domain-specific engineering) they would probably have a problem with this task, e.g. even to produce such tiny numbers. This would also be a problem for the Shapley values solution, actually.
I mean, I can imagine making a decent miner by e.g. just asking an LLM to make relative comparisons and fitting some model to it, but if the miners on the network do not do such things, it would be no use—wait, actually that gives me an idea, see section “reconstructing from relative comparisons”. ### shapley values
Here’s the idea: we create events for each edge, and score miners based on how useful their weight estimate was to the final cost function. Fortunately calculating Shapley values here is easy, because the cost is independent of any permutations.
For each edge weight \(w_{A\to B}\) (and likewise for the 34 edges \(w_{O\to A}\)) we create an event:
Estimate the relative importance of dependency B to project A as a number between 0 and 1. {{ description of how miners will be scored, i.e. a more practical summary of this section }} {{ training set data }}
From all the miner estimates for \(w_{A\to B}\) we get a consensus estimate \(\hat{w}_{A\to B}\) in the usual way5 .
[IGNORE THIS. go with either the straightforward approach or the below one]
0.3.2. shapley values from relative comparisons
Ok, here’s perhaps the most promising approach: we do give miners pairs of edges. But we don’t need to give them all pairs of edges.
For node \(A\) with dependencies \(B_1\dots B_n\), we can just write questions for the \(n-1\) adjacent pairs of edges: - \(c_{(A\to B_1),(A\to B_2)}\) - \(c_{(A\to B_2),(A\to B_3)}\) - … - \(c_{(A\to B_{n-1}),(A\to B_n)}\)
How much more important is dependency B_{i+1} than B_i to A? i.e. estimate log(w_{A→B_{i+1}} / w_{A→B_i}) ...
… and calculate the implied pairwise comparison for any \(c_{(A\to B_i),(A\to B_j)}\) (specifically for those edge pairs the jury scores) by simply summing:
\[
c_{(A\to B_i),(A\to B_j)} = \sum_{k=i}^{j-1}c_{(A\to B_i),(A\to B_{i+1})}
\] (for \(i
Instead, for each \(j_{(A\to B_i),(A\to B_j)}\) that we receive, we can measure the relative contributions of each \(c_{(A\to B_k),(A\to B_{k+1})}\) to the cost function. We imagine that before the miners’ forecasts, all \(c_{(A\to B_k),(A\to B_{k+1})}\) were initialized to zero (i.e. a uniform prior). Then these forecasts define a coalitional game, as follows.
Definition: a single miner’s forecasts as a coalitional game. The miner’s forecast on each \(c_{(A\to B_k),(A\to B_{k+1})}\) defines a player \(i\in\{1,\dots n-1\}\) in an \(n-1\)-player coalitional game, with a value function on subset \(S\subseteq\{1,\dots n-1\}\) as follows:
\[ v(S)= -\sum_{(i,j)}\left(\left|j_{(A\to B_i),(A\to B_j)}-\sum_{k\in S\cap \{i,\dots j-1\}} c_{(A\to B_k),(A\to B_{k+1})}\right|^2-\left|j_{(A\to B_i),(A\to B_j)}\right|^2\right) \] (where the outer summation is taken over all \((i,j)\) pairs such that \(j_{(A\to B_i),(A\to B_j)}\) is in the jury sample)
(crucially, we can just take the remaining forecasts in the expression as “external facts of the world”, i.e. information known to the validator — so the scoring rule itself is modular.)
This lets us take the Shapley value in the definitional way.
\[
\begin{align*} s(c_{(A\to B_k),(A\to B_{k+1})}) &=\varphi_i(v)\\ &= \frac1{n!}\sum_R\left[v(\{j\mid j\le_R i\}\cup\{i\}) - v(\{j\mid j\le_R i\})\right] \end{align*}\] (and then make the peer score adjustment against all other miners) ## most relevant resources
- Concrete instructions including scoring rule, sample submission file etc: https://cryptopond.xyz/modelfactory/detail/2564617?tab=0
- Deepfunding scoring mechanism: https://github.com/deepfunding/scoring
- Eval Primer #2: Build your own model. https://www.youtube.com/watch?v=JUiwrcMASXY — primer for the previous Deepfunding Mini competition
- “d/acc: one year later” from Vitalik’s blog: https://vitalik.eth.limo/general/2025/01/05/dacc2.html#5
0.4. appendix: question generation script
#+beginsrc python import csv import json import pandas as pd import numpy as np
import requests import time from datetime import datetime from urllib.parse import urlparse import os from dotenv import loaddotenv from functools import lrucache
loaddotenv()
INCLUDEREPOHEURISTICS = True
if “GITHUBTOKEN” not in os.environ and INCLUDEREPOHEURISTICS: print(“Warning: GitHub API token not found. You will get rate-limited. You should set INCLUDEREPOHEURISTICS to False.”)
L1TRAINSETEXAMPLES = “” L2TRAINSETEXAMPLES = “” L3TRAINSETEXAMPLES = “”
PROMPTINTRO = “This question is for the `Deepfunding competition`, a distillation prediction market for determining the relative importances of different dependencies in the Ethereum ecosystem.”
USELONGJURORLIST = False SHORTJURORLIST = “\nThe jurors are expected to be experts in the Ethereum ecosystem. Some known names include: Jason, Toni Wahrstatter, Ladislaus, Vitalik Buterin, DC Builder, Vectorized and Marius Van Der.” LONGJURORLIST = “”" This is the list of publicly known jurors, if it helps you. The jurors are expected to be experts in the Ethereum ecosystem.
Juror | Nominator | Votes | Affiliation | Github | ENS/Wallet |
---|---|---|---|---|---|
Vitalik Buterin | Invitation | 10 | EF | vbuterin | vitalik.eth |
Changwu | Vitalik Buterin | 0 | imwallet | ||
Justin Drake | Vitalik Buterin | 0 | EF | ||
Anton Cheng | Changwu | 5 | |||
Nicholas Lin | Changwu | imwallet | |||
Toni Wahrstatter | Justin Drake | 17 | EF | ||
Ladislaus | Justin Drake | 11 | EF | ||
DC Builder | Anton Cheng | 10 | worldcoin | ||
Vectorized | Anton Cheng | 10 | |||
Jason | Nicholas Lin | 32 | |||
Oskar | Nicholas Lin | 2 | |||
Alex Stokes | Toni Wahrstatter | 4 | EF | ||
Parithosh Jayanti | Toni Wahrstatter | Nethermind | |||
Auston Sterling | Ladislaus | 3 | |||
Marius Van Der | Ladislaus | 10 | |||
Mark Tyneway | Vectorized | Optimism | |||
Georgios | Vectorized | Reth | |||
TCZPL | Jason | ||||
Ambition Chen | Jason | 35 | |||
Adrian | Oskar | ||||
Chih Cheng Liang | Oskar | ||||
Matt (lightclient) | Alex Stokes | ||||
Josh Rudolf | Alex Stokes | EF | |||
Mikhail Kalinin | Paritosh Jayanti | Nethermind | |||
Marek Morakzynski | Paritosh Jayanti | Nethermind | |||
Nixo | Auston Sterling | 7 | EF | ||
Logris | Auston Sterling | ||||
Hudson Jameson | Marius Van Der | ||||
Terence Tsao | Marius Van Der | 4 | Prysm | ||
Jacek | Terence Tsao | nimbus | |||
Adrian | Terence Tsao | lighthouse | |||
Haurog | Logris | ||||
Pooja Ranjan | Nixo | ethereum cat herders | |||
Butta | Nixo | ||||
Tim Beiko | Pooja Ranjan | ||||
Sassal0x | Pooja Ranjan | ||||
G | Marek Morakzynski | ||||
Ahmed | Marek Morakzynski | ||||
Ansgar | Ahmed | ||||
Potuz | Ahmed | ||||
Preston | Potuz | ||||
Nishant | Potuz | ||||
Felix Lange | Mikhail Kalinin | go ethereum | |||
Piper Merriam | Mikhail Kalinin | 6 | |||
Janmajaya | Chih Cheng Liang | ||||
Graham | Ambition Chen | ||||
Banri | Ambition Chen | ||||
adjust | Banri | ||||
yanyanho dapplearning | TCZPL | ||||
boge james (weimumu) | TCZPL | ||||
Kelvin | Mark Tyneway | ||||
mlsudo | Mark Tyneway | ||||
Jason Carver | Piper Merriam | ||||
Redwan Meslem | Invitation | web3.js | |||
Richard Moore | Invitation | ethers.js | |||
tom | Invitation | Viem | |||
Patricio | Invitation | Hardhat | |||
Andrew | Invitation | Remix | |||
Bryant Eisenbach | Invitation | Ape | |||
benny | Invitation | Boa | |||
Ligi | Invitation | Chainlist | |||
benny | Invitation | Vyper | |||
Kaan | Invitation | Sourcify | |||
Austin Griffith | Invitation | Scaffold-eth (v1 + v2) | |||
Jaydon | zengjiajun.eth | elytro | |||
Joi | zengjiajun.eth | elytro | |||
Marc | Invitation | web3.py | |||
Felipe | Invitation | web3.py | |||
Wesley | Sky | EF |
“”"
TRAINSETEXAMPLESPROMPT = ( “Here are some existing juror answers from the public ‘training set’.” )
L1PROMPTBASE = “”"
For this question, you will need to estimate the relative importances of two direct dependencies of Ethereum:
<QUESTION> {repo1} and {repo2} are dependencies of Ethereum. Estimate the ratio of importance of {repo2} to {repo1}. E.g. if {repo2} is 10 times more important then {repo1} then answer “10”; if {repo1} is 10 times more important than {repo2} then answer “0.1”. </QUESTION>
This exact question will be asked to the Deepfunding jury. Your job is to predict how the Deepfunding jury will answer this question, and answer as close as possible to what the jury will. To be exact: we will score you based on the mean-squared-error between the log of your answer and the log of the jury’s answer.
Your answer must be a positive float.
“”"
L2PROMPTBASE = “”"
For this question, you will be given a repository and you need to estimate how much of its value belongs to that repository itself, versus its dependencies.
<QUESTION> How much of {repo}’s value comes from itself, versus its dependencies? E.g.
1. 0.2 – The project is largely a fork or wrapper of something else; it does less original work relative to the work in its dependencies.\
Examples: Brave (a fork of Chromium), Ollama (a wrapper of llama.cpp).
2. 0.5 – The project is heavily dependent on its dependencies but also has substantial original work.\
Example: An Ethereum wallet.
3. 0.8 – The project is mostly original work and depends only on generic libraries; it could likely have been built without those dependencies if necessary.
</QUESTION>
This exact question will be asked to the Deepfunding jury. Your job is to predict how the Deepfunding jury will answer this question, and answer as close as possible to what the jury will. To be exact: we will score you based on the mean-squared-error between your answer and the jury’s answer.
Your answer must be a float between 0 and 1.
“”"
L3PROMPTBASE = “”"
For this question, we are looking at the {parent} repository. You will need to estimate the relative importances of two dependencies of this repository – i.e. which of their dependencies matters more for {parent}.
<QUESTION> {repo1} and {repo2} are dependencies of {parent}. Estimate the ratio of importance of {repo2} compared to {repo1} for {parent}. E.g. if {repo2} is 10 times more important then {repo1} for {parent} then answer “10”; if {repo1} is 10 times more important than {repo2} then answer “0.1”. </QUESTION>
This exact question will be asked to the Deepfunding jury. Your job is to predict how the Deepfunding jury will answer this question, and answer as close as possible to what the jury will. To be exact: we will score you based on the mean-squared-error between the log of your answer and the log of the jury’s answer.
Your answer must be a positive float.
“”"
class RepoHeuristics: “”“Fetch and cache heuristics about GitHub repositories.”“”
def _init_(self, githubtoken=None, cachettl=3600): “”" Initialize the heuristics fetcher.
Args: githubtoken: GitHub API token (optional but recommended to avoid rate limits) cachettl: Time in seconds to cache repository data “”" self.githubtoken = githubtoken or os.environ.get(’GITHUBTOKEN’) self.headers = {’Authorization’: f’token {self.githubtoken}’} if self.githubtoken else {} self.cachettl = cachettl self.repocache = {}
@staticmethod def parsegithuburl(repourl): “”“Extract owner and repo name from a GitHub URL.”“” if not repourl or ’github.com’ not in repourl: return None, None
parsed = urlparse(repourl) pathparts = parsed.path.strip(’’).split(’’)
if len(pathparts) < 2: return None, None
return pathparts[0], pathparts[1]
@lrucache(maxsize=128) def getrepoinfo(self, repourl): “”" Fetch detailed information about a repository.
Args: repourl: Full URL to the GitHub repository
Returns: Dictionary with repository information or None on failure “”" owner, repo = self.parsegithuburl(repourl) if not owner or not repo: return None
cachekey = f“{owner}/{repo}” if cachekey in self.repocache: cacheddata, timestamp = self.repocache[cachekey] if time.time() - timestamp < self.cachettl: return cacheddata
try: repoapiurl = f“https://api.github.com/repos/%7Bowner%7D/%7Brepo}” response = requests.get(repoapiurl, headers=self.headers) response.raiseforstatus() repodata = response.json()
contributorsurl = f“{repoapiurl}/contributors?perpage=5” contributorsresp = requests.get(contributorsurl, headers=self.headers) contributorsresp.raiseforstatus() contributors = contributorsresp.json()
languagesurl = f“{repoapiurl}/languages” languagesresp = requests.get(languagesurl, headers=self.headers) languagesresp.raiseforstatus() languages = languagesresp.json()
repoinfo = { ’name’: repodata.get(’name’), ’fullname’: repodata.get(’fullname’), ’description’: repodata.get(’description’), ’stars’: repodata.get(’stargazerscount’, 0), ’forks’: repodata.get(’forkscount’, 0), ’watchers’: repodata.get(’watcherscount’, 0), ’openissues’: repodata.get(’openissuescount’, 0), ’createdat’: repodata.get(’createdat’), ’updatedat’: repodata.get(’updatedat’), ’contributors’: [c.get(’login’) for c in contributors[:5]], ’languages’: languages, ’homepage’: repodata.get(’homepage’), ’license’: repodata.get(’license’, {}).get(’name’) if repodata.get(’license’) else None, ’topics’: repodata.get(’topics’, []), ’size’: repodata.get(’size’, 0) }
self.repocache[cachekey] = (repoinfo, time.time())
return repoinfo
except Exception as e: print(f“Error fetching data for {repourl}: {e}”) return None
def formatrepoheuristics(self, repourl): “”“Format repository information as a readable string.”“” repoinfo = self.getrepoinfo(repourl) if not repoinfo: return f“Repository {repourl} information not available.”
totalbytes = sum(repoinfo[’languages’].values()) if repoinfo[’languages’] else 1 languagepercentages = {lang: f“{count/totalbytes*100:.1f}%” for lang, count in repoinfo[’languages’].items()}
createddate = datetime.strptime(repoinfo[’createdat’], “%Y-%m-%dT%H:%M:%SZ”) if repoinfo[’createdat’] else None age = (datetime.now() - createddate).days // 30 if createddate else “unknown”
infostring = f“”"[{repoinfo[’fullname’]}]:
- Description: {repoinfo[’description’] or ’No description’}
- Stars: {repoinfo[’stars’]}, Forks: {repoinfo[’forks’]}
- Age: {age} months, Last updated: {repoinfo[’updatedat’].split(’T’)[0] if repoinfo[’updatedat’] else ’Unknown’}
- Main languages: {’, ’.join(f“{lang} ({pct})” for lang, pct in list(languagepercentages.items())[:3])}
- Top contributors: {’, ’.join(repoinfo[’contributors’]) if repoinfo[’contributors’] else ’Unknown’}
- Topics: {’, ’.join(repoinfo[’topics’]) if repoinfo[’topics’] else ’None’}
“”" return infostring
def comparerepos(self, repo1url, repo2url): “”“Compare two repositories and return a formatted comparison string.”“” repo1info = self.getrepoinfo(repo1url) repo2info = self.getrepoinfo(repo2url)
if not repo1info or not repo2info: return “Comparison information not available for one or both repositories.”
starratio = repo2info[’stars’] / max(1, repo1info[’stars’]) if starratio > 1.5: starcomparison = f“{repo2info[‘fullname’]} has {starratio:.1f}x more stars than {repo1info[‘fullname’]}” elif starratio < 0.67: starcomparison = f“{repo1info[‘fullname’]} has {(1/starratio):.1f}x more stars than {repo2info[‘fullname’]}” else: starcomparison = f“Both repositories have similar star counts ({repo1info[‘stars’]} vs {repo2info[‘stars’]})”
try: repo1date = datetime.strptime(repo1info[’updatedat’], “%Y-%m-%dT%H:%M:%SZ”) if repo1info[’updatedat’] else None repo2date = datetime.strptime(repo2info[’updatedat’], “%Y-%m-%dT%H:%M:%SZ”) if repo2info[’updatedat’] else None
if repo1date and repo2date: datediff = abs((repo1date - repo2date).days) if datediff > 30: morerecent = f“{repo1info[‘fullname’] if repo1date > repo2date else repo2info[‘fullname’]} has been updated more recently” else: morerecent = “Both repositories have been updated recently” else: morerecent = “Update information not available for comparison” except: morerecent = “Error comparing dates”
comparison = f“”"
- {repo1info[’fullname’]} vs 2. {repo2info[’fullname’]}
- Star comparison: {starcomparison}
- Activity: {morerecent}
- Age: {repo1info[’createdat’].split(’T’)[0] if repo1info[’createdat’] else ’Unknown’} vs {repo2info[’createdat’].split(’T’)[0] if repo2info[’createdat’] else ’Unknown’}
- Languages: • {repo1info[’fullname’]}: {’, ’.join(list(repo1info[’languages’].keys())[:3]) if repo1info[’languages’] else ’Unknown’} • {repo2info[’fullname’]}: {’, ’.join(list(repo2info[’languages’].keys())[:3]) if repo2info[’languages’] else ’Unknown’}
“”" return comparison
repoheuristics = RepoHeuristics()
def _formatreponame(repo: str): “”“Format a repository name for display.”“” return ( repo.replace(“https://”, “”) .replace(“http://”, “”) .replace(“www.”, “”) .replace(“github.com”, “”) .strip(“/”) )
def formatl1prompt(repo1, repo2): prompt = PROMPTINTRO + L1PROMPTBASE.format( repo1=formatreponame(repo1), repo2=formatreponame(repo2) )
if INCLUDEREPOHEURISTICS:
try: comparison = repoheuristics.comparerepos(repo1, repo2) prompt += “Repository Comparison:\n” + comparison except Exception as e: print(f“Warning: Could not fetch repository comparison for {repo1} and {repo2}: {e}”)
prompt += LONGJURORLIST if USELONGJURORLIST else SHORTJURORLIST prompt += ( TRAINSETEXAMPLESPROMPT + “\n\n” + L1TRAINSETEXAMPLES if L1TRAINSETEXAMPLES else “” ) return prompt
def formatl2prompt(repo): prompt = PROMPTINTRO + L2PROMPTBASE.format(repo=formatreponame(repo))
if INCLUDEREPOHEURISTICS:
try: repoinfo = repoheuristics.formatrepoheuristics(repo) prompt += “Repository Information:\n” + repoinfo except Exception as e: print(f“Warning: Could not fetch repository information for {repo}: {e}”)
prompt += LONGJURORLIST if USELONGJURORLIST else SHORTJURORLIST prompt += ( TRAINSETEXAMPLESPROMPT + “\n\n” + L2TRAINSETEXAMPLES if L2TRAINSETEXAMPLES else “” ) return prompt
def formatl3prompt(repo1, repo2, parent): prompt = PROMPTINTRO + L3PROMPTBASE.format( repo1=formatreponame(repo1), repo2=formatreponame(repo2), parent=formatreponame(parent), )
if INCLUDEREPOHEURISTICS:
try: parentinfo = repoheuristics.formatrepoheuristics(parent) prompt += “Parent Repository Information:\n” + parentinfo
comparison = repoheuristics.comparerepos(repo1, repo2) prompt += “\nDependency Comparison:\n” + comparison except Exception as e: print(f“Warning: Could not fetch repository information: {e}”)
prompt += LONGJURORLIST if USELONGJURORLIST else SHORTJURORLIST prompt += ( TRAINSETEXAMPLESPROMPT + “\n\n” + L3TRAINSETEXAMPLES if L3TRAINSETEXAMPLES else “” ) return prompt
def loadcsv(filepath): “”“Load CSV file into a pandas DataFrame.”“” try: df = pd.readcsv(filepath, skipinitialspace=True) return df except Exception as e: raise Exception(f“Error loading CSV file: {e}”)
def extractimportantrepos(df): “”“Extract the list of important repositories from rows with ethereum as parent.”“”
importantrepos = list(df[df[“parent”] == “ethereum”][“repo”]) return importantrepos
def validateimportantrepos(df, importantrepos): “”“Perform validations on the list of important repositories.”“”
if len(importantrepos) != 35: raise ValueError( f“Expected 35 important repos, but found {len(importantrepos)}” )
ethereumchildren = set(df[df[“parent”] = "ethereum"]["repo"].unique())
originality_children = set(df[df["parent"] =
“originality”][“repo”].unique())
if ethereumchildren != originalitychildren: diff = ethereumchildren.symmetricdifference(originalitychildren) raise ValueError(f“Mismatch between ethereum and originality lists: {diff}”)
middlesections = df[~df[“parent”].isin([“ethereum”, “originality”])] middleparents = set(middlesections[“parent”].unique())
if not middleparents.issubset(set(importantrepos)): invalidparents = middleparents - set(importantrepos) raise ValueError( f“Found items in middle column that are not in importantrepos: {invalidparents}” )
return True
def calculatedependencyweights(df, importantrepos): “”“ Calculate total weights of dependencies for each important repo and validate they sum to 1. ”“” dependencyweights = {} dependencycounts = {}
for repo in importantrepos:
deps = df[ (df[“parent”] == repo) & ~df[“repo”].isin([“ethereum”, “originality”]) ]
totalweight = deps[“weight”].sum() dependencyweights[repo] = totalweight
dependencycounts[repo] = len(deps)
if len(deps) > 0 and not np.isclose(totalweight, 1.0, atol=1e-6): raise ValueError(f“Weights for {repo} sum to {totalweight}, not 1.0”)
return dependencyweights, dependencycounts
def calculatecombinations(n): “”“Calculate n choose 2, which is n*(n-1)/2”“” return n * (n - 1) / 2
def getrepoclassificationweights(df, importantrepos):
“”“Get the ethereum and originality weights for each important repo.”“”
ethereumweights = dict(
zip(
df[df[“parent”] = "ethereum"]["repo"],
df[df["parent"] =
“ethereum”][“weight”],
)
)
originalityweights = dict(
zip(
df[df[“parent”] = "originality"]["repo"],
df[df["parent"] =
“originality”][“weight”],
)
)
return ethereumweights, originalityweights
def compileoutputcsv( importantrepos, dependencyweights, dependencycounts, ethereumweights, originalityweights, outputfile, ): “”“Create the output CSV file with the required columns.”“” combinations = {} totalcombinations = 0
for repo in importantrepos: count = dependencycounts.get(repo, 0) comb = calculatecombinations(count) combinations[repo] = comb totalcombinations += comb
with open(outputfile, “w”, newline=“”) as csvfile: fieldnames = [ “importantrepo”, “sumdepweights”, “numdeps”, “numdepscombinations”, “originality”, “ethereum”, ] writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader() for repo in importantrepos: writer.writerow( { “importantrepo”: repo, “sumdepweights”: dependencyweights.get(repo, 0), “numdeps”: dependencycounts.get(repo, 0), “numdepscombinations”: combinations.get(repo, 0), “originality”: originalityweights.get(repo, 0), “ethereum”: ethereumweights.get(repo, 0), } )
return totalcombinations
def generatequestions(df, importantrepos, questionsfile): “”" Generate three lists of questions based on the data and write them to a JSONL file:
- Level 1: Questions about consecutive pairs of important repos
- Level 2: Questions about each important repo’s originality
- Level 3: Questions about consecutive pairs of dependencies for each important repo
Args: df: DataFrame containing the CSV data importantrepos: List of important repositories questionsfile: Path to the output JSONL file
Returns: Tuple of (level1count, level2count, level3count) indicating the number of questions generated “”"
with open(questionsfile, “w”) as jsonlfile:
level1count = _generateandwritelevel1questions(importantrepos, jsonlfile)
level2count = _generateandwritelevel2questions(importantrepos, jsonlfile)
level3count = _generateandwritelevel3questions( df, importantrepos, jsonlfile )
return level1count, level2count, level3count
def _generateandwritelevel1questions(importantrepos, jsonlfile): “”" Generate and write questions for consecutive pairs of important repos.
Args: importantrepos: List of important repositories jsonlfile: File handle for the output JSONL file
Returns: Number of questions generated “”" count = 0
for i in range(len(importantrepos)): repo1 = importantrepos[i] repo2 = importantrepos[(i + 1) % len(importantrepos)] # Wrap around to start
content = formatl1prompt(repo1, repo2)
question = { “level”: 1, “repo1”: repo1, “repo2”: repo2, “parent”: “ethereum”, “content”: content, }
json.dump(question, jsonlfile) jsonlfile.write(“\n”) count += 1
return count
def _generateandwritelevel2questions(importantrepos, jsonlfile): “”" Generate and write questions about each important repo’s originality.
Args: importantrepos: List of important repositories jsonlfile: File handle for the output JSONL file
Returns: Number of questions generated “”" count = 0
for repo in importantrepos: content = formatl2prompt(repo)
question = { “level”: 2, “repo”: repo, “parent”: “originality”, “content”: content, }
json.dump(question, jsonlfile) jsonlfile.write(“\n”) count += 1
return count
def _generateandwritelevel3questions(df, importantrepos, jsonlfile): “”" Generate and write questions about consecutive pairs of dependencies for each important repo.
Args: df: DataFrame containing the CSV data importantrepos: List of important repositories jsonlfile: File handle for the output JSONL file
Returns: Number of questions generated “”" count = 0
middlerows = df[~df[“parent”].isin([“ethereum”, “originality”])]
for parentrepo in importantrepos:
dependencies = middlerows[middlerows[“parent”] == parentrepo][ “repo” ].tolist()
if len(dependencies) > 1: # Only if there are at least 2 dependencies for i in range(len(dependencies)): repo1 = dependencies[i] repo2 = dependencies[ (i + 1) % len(dependencies) ] # Wrap around to start
content = formatl3prompt(repo1=repo1, repo2=repo2, parent=parentrepo)
question = { “level”: 3, “repo1”: repo1, “repo2”: repo2, “parent”: parentrepo, “content”: content, }
json.dump(question, jsonlfile) jsonlfile.write(“\n”) count += 1
return count
def main(inputfile, outputfile, questionsfile=None): “”“Main function to process the CSV file and optionally generate questions.”“” try:
df = loadcsv(inputfile)
importantrepos = extractimportantrepos(df)
validateimportantrepos(df, importantrepos)
dependencyweights, dependencycounts = calculatedependencyweights( df, importantrepos )
ethereumweights, originalityweights = getrepoclassificationweights( df, importantrepos )
totalcombinations = compileoutputcsv( importantrepos, dependencyweights, dependencycounts, ethereumweights, originalityweights, outputfile, )
print(f“Successfully processed {inputfile} and created {outputfile}”) print( f“Found {len(importantrepos)} important repositories with valid weight distributions.” ) print( f“Total number of dependency pairs (numdepscombinations): {totalcombinations}” )
if questionsfile: level1count, level2count, level3count = generatequestions( df, importantrepos, questionsfile )
print(f“\nGenerated and wrote to {questionsfile}:”) print(f“- {level1count} level 1 questions (important repo comparisons)”) print(f“- {level2count} level 2 questions (repo originality)”) print(f“- {level3count} level 3 questions (dependency comparisons)”) print(f“Total: {level1count + level2count + level3count} questions”)
except Exception as e: print(f“Error: {e}”) raise
if name == “main”: import argparse
parser = argparse.ArgumentParser(description=“Process repo dependency CSV file.”) parser.addargument(“inputfile”, help=“Path to the input CSV file”) parser.addargument( “–outputfile”, default=“repoanalysis.csv”, help=“Path to the output CSV file (default: repoanalysis.csv)”, ) parser.addargument( “–questionsfile”, help=“Path to output JSONL file for generated questions” )
args = parser.parseargs() main(args.inputfile, args.outputfile, args.questionsfile) #+endsrc
3.1. appendix: sample questions
Level 1
This question is for the `Deepfunding competition`, a distillation prediction market for determining the relative importances of different dependencies in the Ethereum ecosystem. For this question, you will need to estimate the relative importances of two direct dependencies of Ethereum: <QUESTION> web3/web3.js and prysmaticlabs/prysm are dependencies of Ethereum. Estimate the ratio of importance of prysmaticlabs/prysm to web3/web3.js. E.g. if prysmaticlabs/prysm is 10 times more important then web3/web3.js then answer "10"; if web3/web3.js is 10 times more important than prysmaticlabs/prysm then answer "0.1". </QUESTION> This exact question will be asked to the Deepfunding jury. **Your job is to predict how the Deepfunding jury will answer this question, and answer as close as possible to what the jury will.** To be exact: we will score you based on the mean-squared-error between the log of your answer and the log of the jury's answer. Your answer must be a positive float. Repository Comparison: 1. web3/web3.js vs 2. prysmaticlabs/prysm - Star comparison: web3/web3.js has 5.5x more stars than prysmaticlabs/prysm - Activity: Both repositories have been updated recently - Age: 2014-09-30 vs 2018-01-11 - Languages: • web3/web3.js: TypeScript, JavaScript, Shell • prysmaticlabs/prysm: Go, Starlark, Shell The jurors are expected to be experts in the Ethereum ecosystem. Some known names include: Jason, Toni Wahrstatter, Ladislaus, Vitalik Buterin, DC Builder, Vectorized and Marius Van Der.
Level 2
#+beginsrc markdown This question is for the `Deepfunding competition`, a distillation prediction market for determining the relative importances of different dependencies in the Ethereum ecosystem.
For this question, you will be given a repository and you need to estimate how much of its value belongs to that repository itself, versus its dependencies.
<QUESTION> How much of vyperlang/vyper’s value comes from itself, versus its dependencies? E.g.
4. 0.2 – The project is largely a fork or wrapper of something else; it does less original work relative to the work in its dependencies. Examples: Brave (a fork of Chromium), Ollama (a wrapper of llama.cpp).
5. 0.5 – The project is heavily dependent on its dependencies but also has substantial original work. Example: An Ethereum wallet.
6. 0.8 – The project is mostly original work and depends only on generic libraries; it could likely have been built without those dependencies if necessary.
</QUESTION>
This exact question will be asked to the Deepfunding jury. Your job is to predict how the Deepfunding jury will answer this question, and answer as close as possible to what the jury will. To be exact: we will score you based on the mean-squared-error between your answer and the jury’s answer.
Your answer must be a float between 0 and 1.
Repository Information: [vyperlang/vyper]:
- Description: Pythonic Smart Contract Language for the EVM
- Stars: 4995, Forks: 837
- Age: 101 months, Last updated: 2025-03-17
- Main languages: Python (99.8%), Makefile (0.1%), Batchfile (0.1%)
- Top contributors: jacqueswww, charles-cooper, iamdefinitelyahuman, fubuloubu, DavidKnott
- Topics: ethereum, ethereum-dapp, language, python, vyper
The jurors are expected to be experts in the Ethereum ecosystem. Some known names include: Jason, Toni Wahrstatter, Ladislaus, Vitalik Buterin, DC Builder, Vectorized and Marius Van Der. #+endsrc
Level 3
This question is for the `Deepfunding competition`, a distillation prediction market for determining the relative importances of different dependencies in the Ethereum ecosystem. For this question, we are looking at the prysmaticlabs/prysm repository. You will need to estimate the relative importances of two dependencies of this repository -- i.e. which of their dependencies matters more for prysmaticlabs/prysm. <QUESTION> coreos/go-systemd and herumi/bls-eth-go-binary are dependencies of prysmaticlabs/prysm. Estimate the ratio of importance of herumi/bls-eth-go-binary compared to coreos/go-systemd for prysmaticlabs/prysm. E.g. if herumi/bls-eth-go-binary is 10 times more important then coreos/go-systemd for prysmaticlabs/prysm then answer "10"; if coreos/go-systemd is 10 times more important than herumi/bls-eth-go-binary then answer "0.1". </QUESTION> This exact question will be asked to the Deepfunding jury. **Your job is to predict how the Deepfunding jury will answer this question, and answer as close as possible to what the jury will.** To be exact: we will score you based on the mean-squared-error between the log of your answer and the log of the jury's answer. Your answer must be a positive float. Parent Repository Information: [prysmaticlabs/prysm]: - Description: Go implementation of Ethereum proof of stake - Stars: 3546, Forks: 1090 - Age: 87 months, Last updated: 2025-03-17 - Main languages: Go (93.6%), Starlark (5.5%), Shell (0.5%) - Top contributors: terencechain, prestonvanloon, rauljordan, nisdas, rkapka - Topics: ethereum Dependency Comparison: 1. coreos/go-systemd vs 2. herumi/bls-eth-go-binary - Star comparison: coreos/go-systemd has 36.6x more stars than herumi/bls-eth-go-binary - Activity: Both repositories have been updated recently - Age: 2013-09-13 vs 2019-10-19 - Languages: • coreos/go-systemd: Go, Shell • herumi/bls-eth-go-binary: Go, C, C++ The jurors are expected to be experts in the Ethereum ecosystem. Some known names include: Jason, Toni Wahrstatter, Ladislaus, Vitalik Buterin, DC Builder, Vectorized and Marius Van Der.
6.1. appendix: helper csv
just the result of a summarization script on their sample submission file — the main importance is in the num_deps
(\(n_A\) for each \(A\)) and num_deps_combinations
(\(n_A(n_A-1)/2\)) columns.
important_repo,sum_dep_weights,num_deps,num_deps_combinations,originality,ethereum https://github.com/prysmaticlabs/prysm,0.9999999999999936,245,29890.0,0.8650592053492349,0.0294117647058823 https://github.com/ethereum/fe,0.9999999999999927,301,45150.0,0.9443691638516738,0.0294117647058823 https://github.com/ethereum/remix-project,0.9999999999999305,2277,2591226.0,0.7100532015645752,0.0294117647058823 https://github.com/eth-infinitism/account-abstraction,0.999999999999991,863,371953.0,0.4286456194839025,0.0294117647058823 https://github.com/wevm/viem,0.9999999999999376,725,262450.0,0.1450951555623104,0.0294117647058823 https://github.com/nethereum/nethereum,0.9999999999999997,57,1596.0,0.4018369295764015,0.0294117647058823 https://github.com/ethers-io/ethers.js,0.9999999999999998,138,9453.0,0.0095957100544701,0.0294117647058823 https://github.com/chainsafe/lodestar,0.9999999999999116,1516,1148370.0,0.8811032731861512,0.0294117647058823 https://github.com/ethereum-lists/chains,0.9999999999999997,6,15.0,0.6236088131630412,0.0294117647058823 https://github.com/sigp/lighthouse,0.9999999999999987,464,107416.0,0.9133744057295108,0.0294117647058823 https://github.com/ethereum/py-evm,1.0,11,55.0,0.317338474800942,0.0294117647058823 https://github.com/hyperledger/besu,0.0,0,0.0,0.774361090611847,0.0294117647058823 https://github.com/erigontech/erigon,0.9999999999999813,253,31878.0,0.932572749866548,0.0294117647058823 https://github.com/vyperlang/titanoboa,0.9999999999999989,27,351.0,0.1964900800509441,0.0294117647058823 https://github.com/alloy-rs/alloy,0.9999999999999994,19,171.0,0.3690905969681286,0.0294117647058823 https://github.com/ethereumjs/ethereumjs-monorepo,0.9999999999999718,828,342378.0,0.8417883513048304,0.0294117647058823 https://github.com/foundry-rs/foundry,0.9999999999999529,482,115921.0,0.6458766968885356,0.0294117647058823 https://github.com/safe-global/safe-smart-account,0.9999999999999712,538,144453.0,0.6011871268121423,0.0294117647058823 https://github.com/consensys/teku,0.9999999999999998,137,9316.0,0.4685428282978935,0.0294117647058823 https://github.com/grandinetech/grandine,0.9999999999999785,438,95703.0,0.9124435469744914,0.0294117647058823 https://github.com/ethereum/sourcify,0.9999999999999243,908,411778.0,0.4089762481898589,0.0294117647058823 https://github.com/ethereum/solidity,1.0,3,3.0,0.1090974834405934,0.0294117647058823 https://github.com/status-im/nimbus-eth2,0.9999999999999982,104,5356.0,0.417548394392558,0.0294117647058823 https://github.com/openzeppelin/openzeppelin-contracts,0.9999999999999536,562,157641.0,0.3373326583791293,0.0294117647058823 https://github.com/ethereum/web3.py,0.9999999999999996,13,78.0,0.8039938317729571,0.0294117647058823 https://github.com/nethermindeth/nethermind,0.0,0,0.0,0.4171925064865261,0.0294117647058823 https://github.com/apeworx/ape,0.9999999999999994,38,703.0,0.3110356991327095,0.0294117647058823 https://github.com/a16z/helios,0.999999999999944,628,196878.0,0.6740220469622176,0.0294117647058823 https://github.com/paradigmxyz/reth,0.9999999999999601,470,110215.0,0.3837737278955573,0.0294117647058823 https://github.com/scaffold-eth/scaffold-eth-2,0.9999999999999284,859,368511.0,0.688816951981115,0.0294117647058823 https://github.com/vyperlang/vyper,1.0,10,45.0,0.9323242887762672,0.0294117647058823 https://github.com/hyperledger-web3j/web3j,0.0,0,0.0,0.2430654500988898,0.0294117647058823 https://github.com/ethereum/go-ethereum,0.9999999999999986,116,6670.0,0.8467503069554304,0.0294117647058823 https://github.com/nomicfoundation/hardhat,0.9999999999999869,1891,1786995.0,0.5435417307291117,0.0294117647058823
Summing num_deps_combinations
:
Total number of dependency pairs (num_deps_combinations): 8352618.0
\[\frac{34\cdot(34-1)}2+34+8352618=8353774\]
Footnotes:
for some reason, only 4381 such nodes and 10075 edges are present in the visualization graph
I checked this is the case with the sample submission
i.e. it is the ground truth for the logits of your weights; we take the MSE of your logits against this
I’m not really sure of the exact format this data is stored. The competition instructions state it is stored as https://github.com/b1,https://github.com/b2,advantage_b_over_a
but this doesn’t make sense as it must also include \(A\); since multiple projects \(A\) can have the same dependencies \(B_1,B_2\). In general I don’t know where to find the train and public test datasets mentioned in the competition instructions.
BLEG: I’m not sure how this is currently being done, so leaving it abstract — is it just the average weighted by past peer scores earned?