Submission of Entries to the Deep Funding Mini Contest

Hello model builders,

Consider this thread as your home for sharing all things related to your submissions in the mini contest part 1 and part 2.

Your write-up here will determine prize distributions.

We encourage you to be visual in your submissions to show weights given to the models, share your juypter notebooks or code used in the submissions, explain the difference in performance of the same model in part 1 vs part 2 (you can also only submit to only one contest if you like), other datasets that are useful for other participants and any other information you deem valuable to participants that you want judges to consider.

The format of submissions is open ended and free for you to express yourself the way you like. You should make early posts with preliminary results, since being early helps other participants and is viewed favorably by judges.

If you make strides in your model predictions, you can edit your earlier post with the new information or make another posts and link to it from your 1st post. The totality of content you share in this thread will be considered by judges.

Good luck predictoooors

7 Likes

Baseline predictions to assess information gain from external data and ML methods

Our submissions did not intend to be top-scored, but rather as a baseline to assess how much data leak is in part 1 and how much information is in the training set without accessing external data or using any ML methods.

For both parts, our methods were designed to meet the transitivity requirement (e.g. A<B and B<C must imply A<C) by using a proxy “score” assigned to each project rather than calculating the pairwise comparisons directly. That is, the algorithms calculate a single numeric score for each project, then we use the ratio between two scores to calculate the respective weights between two projects.

Part 1 (MSE = 0.0266)

Code: deepfunding-mini-contest/baseline_submission.ipynb at baseline · BonAppResearch/deepfunding-mini-contest · GitHub

Algorithm:

  1. Sort the projects by the number of occurences in the data.
  2. Starting from the project with the most comparisons (start_project), we assign a score of 1, then we pick the next most compared project that has a connection with this start_project as the next node in the path.
  3. Continue this process while removing visited nodes at each turn, until the path ends (there is no more unvisited node connected to current node).
  4. Normalize the scores of nodes on this path by doing an L2 normalization.
  5. Pick the project with the next most comparisons as start_project, and repeat steps 2 to 4.
  6. Then do an average of the scores obtained by each node (project).
  7. Use the averaged score to calculate the weights of each comparison in the testing data.

The decent score obtained by this non-ML method shows that there was data leak from the test set, since all projects in the testing data are also present in the training data.

Part 2 (MSE = 0.1057)

Code: GitHub - BonAppResearch/deepfunding-mini-contest

Since the funding amount is added as part of the data, there is no need to assign a “score”. Instead, the funded amount was used directly as a proxy “score”.

Algorithm:

  1. The funded amount of each project is calculated from multiplying their weight_{a,b} with total_amount_usd
  2. A project has a different funded amount from each funder and at each different quarter. A weighted average is taken by assigning weights to each funder at each round by the proportions of total fundings that is present at the data (note: this is not really accurate total fundings, but an estimate without using external data).

An alternative method of just summing all funded amounts (i.e. equal weights) is also explored with MSE=0.1068

  1. For pairs of projects that are both present in the testing data AND the training data (disregarding funder/quarter), predictions are made with doing the ratio between the weighted averaged funded amount.
  2. For projects that are not present in the training data, a dummy ratio of 0.5 is assigned.

Since there is no data leak between the training and testing data in Part 2, a lower performance is expected. However, the relative success of this non-ML method shows that past funded amounts is a key predictor of future funded amounts.

4 Likes

Deepfunding-Agent

We believe AI Agents can make better funding decisions than humans.
AI Agents collect more data than humans
AI Agents provide better reasoning for its decision

GitHub: GitHub - SHAO-Jiaqi757/deepfunding

This project aims to serve as a AI Agent framework for Deepfunding.
Welcome everyone to build your own agent and contribute to this project.

How It Works

  1. Metrics Collector
  • Gathers repository metrics from OSO
  • Fetch README content
  • Online search for additional information
  • Send metrics to Analyzer Agents
  1. Multi Analyzer Agents
  • Each agent reads the metrics and provides structured analysis including weights, reasoning, and confidence.
  • Project Analyzer: Evaluates technical aspects and project fundamentals
  • Funding Strategist: Focuses on funding history and resource allocation
  • Community Advocate: Analyzes community engagement and ecosystem impact
  1. Validator: Check each agent’s analysis is comprehensive and well-justified. If not, send back to analyzer agent for revision.
  2. Consensus: Combine results from all agents, calculate final weights

Results

For now, we only have a small set of result for Huggingface part1. So no MSE is available.

We noticed that agent’s results are quite different from human’s.

Case Study

Take a look at this extreme example. All analyzer agents give opposite result from human’s, they think teku should get more funding than prettier-plugin-solidity.

Repository Agent Score Human Score
prettier-solidity/prettier-plugin-solidity 0.38 0.67
consensys/teku 0.62 0.33
3 Likes

I submitted under my last name, Niemerg. The code for my submission can be found at: GitHub - aniemerg/mini-contest

My approach started with a simple question: could an LLM extract meaningful signals from project documentation?

I explored using GPT-4o to extract features from project documentation for predicting the relative funding preferences between open source projects. Rather than extensively engineering features or deeply optimizing the approach, I wanted to see what insights we could gain from having GPT-4o analyze project READMEs.

The feature extraction process used GPT-4o to analyze GitHub README files to determine potential features. This resulted in a total of 41 feature dimensions. I then had the model evaluate each Readme across those 41 dimensions.

The features include both explicit signals directly visible in the documentation (like API design and testing practices) and implicit signals that might indicate project health (like enterprise readiness and community engagement). This automated approach generated a rich set of features covering technical complexity, documentation quality, community health, project maturity, security practices, and more.

The prediction pipeline combined these extracted features with XGBoost for prediction. Initial analysis revealed an interesting challenge: the training data showed a strong U-shaped distribution, with many values near 0 and 1, but our initial predictions showed a more bell-shaped distribution that failed to capture these extremes.

This observation led to implementing logit and arctanh transformations on the target values before training, which significantly improved our ability to capture the U-shaped nature of the preferences. The enhanced model achieved a training MSE of 0.0310 and test MSE of 0.0517.

Analysis of feature importance revealed interesting insights about what drives funding decisions. The most predictive features were community size (0.108), corporate backing (0.069), backward compatibility (0.046), setup complexity (0.045), and learning curve (0.044). This suggests funders heavily weight community metrics and enterprise-readiness indicators in their decisions. The prominence of backward compatibility and setup complexity hints at a preference for stable, accessible projects.

This approach could be complementary to methods using repository metrics like stars and forks or other data sources. While I focused on a straightforward implementation, there’s room for feature engineering refinement and model parameter optimization. My hope is that others in the community might find these GPT-4o extracted features useful in combination with their own approaches.

2 Likes

Predictive Funding Challenge

These are my solutions for both challenges. The principles I tried to keep in mind are:

  • Keep the data sources and transformations minimal
  • Keep the model simple and fast
  • Don’t use funding data¹

¹ I initially tried to avoid using any funding data to “specialize” on projects that might be “new” in a round, and realized the models also work well for veteran projects. Also, using funding data for the HuggingFace competition leaks data as the target is averaging data across previous funding rounds.


Data Pipelines and Models are Open Source and available at GitHub.

davidgasquez/predictive-funding-challenge

Part 1

Given the dataset, I focused on getting each project’s latest data from GitHub and using only that and some simple features on top to make predictions.

Features like size, watchers, forks, issues, stars turned out to be important and correlated well with the target. Engineering features like ratios and dates also helped quite a bit. This was interesting to see given some of the charts I was generating during the EDA.

There is one trick though that made the final MSE much better. That is, to basically mirror the data. Switching project_a with project_b and appending the results to the original dataset essentially doubled the amount of data for training. This is how the target feature distribution looks like after mirroring.

I tried more features, some of them did help (like README embeddings) but they slowed down the training too much. It was important to keep the training fast to iterate on the model hyperparameters quickly.

For training, I initially went with a LightGBM model. To decide weights, I used a grid search on top of a 5-fold cross validation. Once the model was fitted, I could get the feature importance and see which features were most important and iterate on them fast (the entire fit and predict process takes less than 3 minutes).

More importantly, improving the local MSE also improved the leaderboard score, which is a good sign that the model is working well. Now, I could focus on optimizing the model for this specific dataset, doing a better hyperparameter search or model stacking.

Limitations

Given the shape of the datasets and what they represent, the leaderboard score is not a good indicator of the model performance in real life. The models are useful to derive feature importance and explore the space, but have some important issues. There has been already some discussion about this in the Telegram group but wanted to share some concerns.

  1. By getting only the latest GitHub data, we are leaking information from the future.

  2. The weight_a aggregates weights across rounds, so using any kind of funding data (or things like Pond train dataset weights) leaks lots of information.

  3. The same projects appear in both training and test sets, so models will overfit to them if the project_a and project_b features are used (this is similar to the approach @cathie took, also called Target Encoding).

Model Details

The final submission uses an average of the best LightGBM model submissions with different hyperparameters. The model achieves an MSE of 0.0106 on 5-fold cross validation locally, which translates to a leaderboard MSE score of 0.0114.

2025-01-30: Added some post-processing to the weight predictions to update the target variable distribution to match the training distribution. This smaller model makes the leakage even stronger and gets a score of ~0.007. Training this model with the Pond dataset will make this even stronger and impacting the score even more.

Feature Engineering

  1. Mirroring the dataset. Duplicate the training samples for free (and profit)!
  2. GitHub: Direct metrics like stars, watchers, forks, issues, and size.
  3. Ratios: Comparing metrics between project pairs (e.g., stars_ratio, forks_ratio)
  4. Interactions: Combined metrics like stars*forks to capture engagement
  5. Temporal Features: Repository age and update frequency. Didn’t want to spend much time there, as they leak quite a bit of information.

Part 2

For the second part, I took a similar approach with some key differences given the nature of the data. The final model is a multi-model ensemble of LightGBM regressors with TabNet and CatBoost models. All of them trained with time-series aware cross validation to ensure we’re not looking into the future when making predictions.

Key Findings

  • Project identifiers (URLs) are the most important features, suggesting strong project-specific patterns
  • The temporal features (year_quarter, year) are also significant, indicating clear time-based trends
  • Organization names carry substantial predictive power
  • Funder information helps but is less important than other features. That said, I generated averages at the funder level to help the models (average generosity, funded projects, total stars, …)
  • The model heavily relies on project identifiers, which might not generalize well to completely new projects. For this case, having some kind of embeddings for the project code would help.
  • Adding more features like Criticality Score didn’t do much.
  • There is a data leaking issue when using total_amount_usd variable¹.
  • Post-processing wise, you can push predictions toward 0/1 while maintaining a + b = 1 to better fit the distribution or train a beta distribution and transform the predictions though inverse CDF.

¹ The total_amount_usd variable wouldn’t be known at prediction time in a real scenario. That means that you can do some target encoding with project_a / project_b / organization and get around ~0.04 MSE (third place) without even using any external data (no gh stars, forks, issues, etc.).

4 Likes

Explicitly increasing model diversity to improve ensemble quality: training models with correlation penalties

Part 1

An explicitly mentioned goal of DeepFunding is to develop models that work together well in an ensemble. However, in a prediction contest, chasing leaderboard ranking is the natural goal for individual model developers, yet this goal could run counter to creating an ensemble of diverse models.

My strategy was to intentionally seek to increase model diversity in the ensemble by submitting a model that is 1) explicitly trained to be diverse, and 2) explicitly not trying for a top leaderboard ranking.

I worked off of davidgasquez’s model and repository, which is the #1 model on part 1’s leaderboard as of 1/20/25. I used the same data featurization approach (excluding target-encoding features, which previously noted to cause overfitting), and re-trained a proficient model according to davidgasquez’s code. This model represents the #1 model*. I then trained a series of new models with a modified loss to explicitly encourage them to make different errors from the #1 model. The loss is mse + alpha * error_correlation, where error correlation is the Pearson correlation between the #1 model’s errors, and the new model’s errors. By varying alpha, we trade-off between increasing model diversity relative to the #1 model, and performance (MSE). I then intentionally submitted a higher-MSE, but highly diverse model.

This process/experiment yielded some empirical insights: it is possible to significantly increase model diversity without substantially reducing training performance.

Overall, this work suggests that correlation penalties have promise for increasing model diversity. Notably, I believe that correlation penalties (this method) are very rarely used in the setting of an ML prediction contest, because in most contests, submission code/etc are closed until after the contest ends. For future work, this approach may be extended to multiple submitted models.

Code is open and reproducible: GitHub - maxwshen/predictive-funding-challenge

Other contributions

I wrote a blog post On diversity and many-model ensembling: AI government & AI-augmented public goods funding | argmax(blog) analyzing the scoring and ensembling approach here: GitHub - deepfunding/scoring: The Deep Funding scoring mechanism. I highlight some properties which may inhibit DeepFunding from scaling to larger projects in the future, specifically in the case where there are more AI models than human validation datapoints. In this scenario, solving for model weights via MSE ensembling is an underdetermined least-squares problem. There is no unique solution, which can result in bikeshedding over why one set of model weights was picked over another. Furthermore, this can result in potentially poor performance to unseen projects, despite the model ensemble perfectly fitting (i.e., overfitting) the validation data used to decide the ensemble weights. In this blog post, I provide an extensive analysis on this and other topics, and propose a simple alternative solution, which can be evaluated in parallel or retrospectively without requiring any protocol change to DeepFunding.

3 Likes

I used data queried from OSO and the HistGradientBoostingRegressor to derive predictions for the variable weight_a. The code for this analysis can be found in my GitHub repository. (GitHub - hara-desu/DeepFundingMiniContest: Code to submit for DeepFunding mini contest)

For the HuggingFace competition, I utilized the dataset provided in Eval Science’s mini tutorial, merging it with additional features obtained from the OSO database. In the Pond competition, I worked with two datasets: one containing 32 features from OSO and another with 22 features. I also calculated ratios and differences for numerical features where appropriate.

Data Processing and Predictions

Resulting predictions across all datasets were out of the range of 0 to 1. Consequently, I capped predictions exceeding 1 at 1 and set negative predictions to 0.

I observed that excluding a project’s GitHub URL and removing features with significant Spearman correlation coefficients reduced overfitting and improved Mean Squared Error (MSE) on test sets for both the HuggingFace and Pond datasets.

Results Summary

  1. MSE, R-squared and Adjusted R-squared
  • HuggingFace:
    • Training Data
      • MSE: 0.0127
      • R-squared: 0.89184
      • Adjusted R-squared: 0.88057
    • Test Dataset MSE: 0.014158
  • Pond Dataset (32 Features):
    • Training data
      • MSE: 0.03114
      • R-squared: 0.79508
      • Adjusted R-squared: 0.7939
    • Test Dataset MSE: 0.0627
  • Pond Dataset (22 Features):
    • Training data
      • MSE: 0.0343
      • R-squared: 0.77424
      • Adjusted R-squared: 0.77337
    • Test Dataset MSE: 0.0518

The lowest MSE for Pond data on the test set (0.0518) was achieved with the 22-features, although its training results were inferior to those of the 32-feature dataset. This suggests that models trained on Pond data may be prone to overfitting, as reducing feature count resulted in poorer training MSE but better test MSE.

  1. Feature Importance Comparison Across Datasets

Feature importances were assessed using the permutation_importance function from scikit-learn (4.2. Permutation feature importance — scikit-learn 1.5.2 documentation), which evaluates how shuffling a feature’s values affects model performance.

  • In both training and testing data from HuggingFace, the model placed significant weight on maintainer_a and maintainer_b, indicating that the project owner plays a crucial role in weight allocation decisions. However, if projects included in train and test datasets are different, then maintainers are also likely to be different and this introduces overfitting (or data leakage) problem that was observer by others. If this is the case, then the bar plots below illustrate this well.

  • The 32-features Pond dataset did not include maintainer features; instead, the model primarily focused on star_count across both datasets, along with other significant features such as contributor count difference, repository count, total funding amount in USD, and year of repository creation.

  • In the 22-feature dataset, while contributor count difference, repository count, and total funding amount remained significant, watcher_count emerged as the most important feature instead of star count. This highlights that user interest in a repository is a significant factor in funding allocation.

  1. Correlation Among Features

There are notable similarities in correlated features across the three datasets. Features with an absolute Spearman correlation coefficient greater than 0.8:

  • Star Count & Fork Count
  • Webpage & Twitter
  • Grant Pools for Project A & Grant Pools for Project B
    • Evident considering small amount of grant pools
  • Developer Count & Contributor Count

The future improvements for the model and datasets would involve adding features to increase R-squared along with test MSE.

Deep Funding Analytics Challenge :bar_chart:

This repository contains the solution to the Deep Funding Analytics Challenge. The approach leverages advanced machine learning techniques, integrates dependency graph analysis, and is designed for scalability and future enhancements, such as real-time data integration using Airbyte.

GitHub Repository: felixphilipk/Deep_Funding_Analytics_Challenge


Problem Overview

The challenge involves predicting funding allocations between pairs of open-source software repositories. The data provided includes:

  • Repositories: A list of open-source software repositories with various attributes.
  • Pairwise Comparisons: Data indicating which repository received more funding in pairwise comparisons.
  • Dependency Graph: Information about dependencies between repositories.

Solution Approach

1. Pairwise Comparison Model Using the Bradley-Terry Framework

We implemented a probabilistic Bradley-Terry model to handle pairwise comparisons between repositories. This model is ideal for ranking and predicting outcomes based on pairwise data.

Repository Strength Parameters ((\beta_i)):

Each repository is assigned a parameter representing its “strength” or propensity to receive funding.

Probability Calculation:

The probability that repository i receives funding over repository j is calculated as:

[
P(i \text{ receives funding over } j) = \frac{\beta_i}{\beta_i + \beta_j}
]

Log-Transformation for Optimization:

To facilitate optimization and improve numerical stability, we use the logarithm of the strength parameters:

[
\theta_i = \log(\beta_i)
]

Substituting into the probability equation, we obtain:

[
P(i \text{ receives funding over } j) = \frac{1}{1 + e^{-(\theta_i - \theta_j)}}
]


2. Optimization with the CMA-ES Algorithm

To estimate the parameters ((\theta_i)), we employed the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) optimizer, which is effective for non-linear, non-convex optimization problems.

Objective Function:

We minimize the Mean Squared Error (MSE) between the observed probabilities in the training data and the model’s predicted probabilities:

[
\text{MSE} = \frac{1}{N} \sum_{k=1}^{N} \left( P_{\text{model}}(k) - P_{\text{observed}}(k) \right)^2
]

Regularization:

To prevent overfitting and improve generalization, an L2 regularization term is added:

[
\text{Regularization} = \lambda \sum_{i} \theta_i^2
]

Total Objective Function:

Combining the MSE and the regularization term gives:

[
\text{Objective} = \text{MSE} + \text{Regularization}
]


3. Incorporating the Dependency Graph

The dependency graph is integrated into the model to further enhance predictions.

Penalty Terms for Dependencies:

For each edge in the dependency graph, a penalty is added to the objective function based on the difference in (\theta) values of the source (i) and target (j) repositories:

[
\text{Penalty}_{\text{edge}} = \left( (\theta_j - \theta_i) - \log(\text{edge_weight} + \varepsilon) \right)^2
]

Here, (\varepsilon) is a small constant introduced to avoid taking the logarithm of zero.

Default Edge Weights:

For edges without specified weights, a meaningful default value is assigned so that all dependencies contribute to the model.


4. Hyperparameter Tuning and Model Refinement

Hyperparameters are carefully tuned to optimize model performance.

  • Regularization Parameter ((\lambda)):
    Values such as (1 \times 10^{-6}) and (1 \times 10^{-5}) were experimented with to balance overfitting and underfitting. The optimal value was chosen based on the validation MSE.

  • Optimizer Settings:

    • Max Evaluations: Adjusted to ensure convergence without unnecessary computation.
    • Population Size: Determined based on the number of repositories to enhance optimizer performance.
    • Sigma Values: Initialized relative to the (\theta) values to effectively guide the optimization process.
  • Validation:
    MSE is monitored on a validation set and cross-validation techniques are used to assess model generalization.


5. Prediction and Output Generation

Computing Strength Parameters:

After optimization, the strength parameters for each repository are calculated as follows:

[
\beta_i = e^{\theta_i}
]

Making Predictions:

For each pair in the test data, the prediction is made as follows:

[
P(\text{project_a receives funding over project_b}) = \frac{\beta_{\text{project_a}}}{\beta_{\text{project_a}} + \beta_{\text{project_b}}}
]

Output:

An output is generated that contains the predicted probabilities, ensuring compatibility and facilitating ease of analysis.


Innovations and Advantages of the Solution

Advanced Machine Learning Techniques

  • Probabilistic Modeling:
    The use of the Bradley-Terry model captures complex funding allocation dynamics.

  • Effective Optimization:
    The CMA-ES optimizer enables efficient optimization in challenging, high-dimensional spaces.

Integration of the Dependency Graph

  • Enhanced Accuracy:
    By considering repository dependencies, the model better reflects real-world funding influences.

  • Structural Awareness:
    The model accounts for key dependencies, potentially highlighting foundational projects.

Future-Ready with Real-Time Data Integration

  • Airbyte Integration Plans:
    Real-time data ingestion from various sources will be used to constantly update repository metrics.

  • Continuous Improvement:
    Incorporating the latest trends ensures that the model remains current and adaptive.

1 Like

Deep Funding Predictive Challenge Mini-Contest

I submitted my results for the competition as follows:

  • Part 1: Result named allen0915 on Hugging Face Space
  • Part 2: Result named Allen Chu on the Pond platform

Public Leaderboard Scores:

  • Part 1 (MSE): 0.0194
  • Part 2 (MSE): 0.0594

Reference:


Approach and Methodology

Data Collection and Feature Engineering

The provided datasets (dataset.csv and test.csv) contained limited information: two GitHub project links and their relative funding weights (weight_a + weight_b = 1). These features alone were insufficient for building a robust machine learning model.

To address this, I utilized the GitHub API to extract comprehensive repository metadata. The raw data was preprocessed and refined into structured features. Below are the key features engineered through this process:

  • Repository Attributes:

    • is_fork
    • fork_count
    • star_count
    • watcher_count
    • language
    • license_spdx_id
    • repo_created_at
    • repo_updated_at
    • repo_exist_days (calculated as the difference between the current date and creation date)
  • Commit Information:

    • first_commit_time
    • last_commit_time
    • commit_count
    • work_days (difference between the first and last commit times)
    • days_with_commits_count (total days with commits)
  • Community Engagement:

    • contributors_to_repo_count

Additionally, I employed Vision-Language Models (VLM) and web-crawling techniques to augment the dataset further. Using Large Language Models (LLMs), I created a set of qualitative metrics to evaluate repository quality. The evaluation criteria and prompts were as follows:

{
    "Readme_score": "Rate the quality of the README on a scale of 1 to 5, with 5 being the best and 1 being the worst. Decimals are allowed.",
    "technical_innovation": "Rate the technical innovation of the repository on a scale of 1 to 5, with 5 being highly innovative and 1 being not innovative.",
    "community_engagement": "Rate the community engagement on a scale of 1 to 5, with 5 being highly engaged and 1 being poor engagement.",
    "accessibility": "Rate the friendliness to new contributors on a scale of 1 to 5, with 5 being very friendly and 1 being not friendly."
}

These features were paired to create input datasets where the information for project_a and project_b was used to predict the target label weight_a.

Solution and Model Development
For model training, I experimented with various algorithms, ranging from non-tree-based methods (Linear Regression, SVM) to tree-based ensemble methods (Random Forest, XGBoost). Among these, XGBoost outperformed the others in MSE score.

Feature Selection
To refine the model further, I analyzed feature importance scores generated by XGBoost. Below is an example of the feature importance plot:

By iteratively testing different thresholds for feature selection, I identified the top 20 most important features that minimized the MSE on the testing data. These features were then used as inputs for the final model.

Future work

Model Optimization:
Fine-tuning XGBoost hyperparameters to achieve even lower MSE scores.
Feature Expansion:
Exploring additional GitHub repository features and testing combinations to improve predictive accuracy.
LLM Utilization:
Further integrating LLMs for qualitative evaluation of repositories. As some research papers suggest, LLMs can effectively analyze projects considering multiple complex factors beyond numeric metrics like star_count and fork_count.
Refactor Code for better use!

This competition has been an exciting opportunity to experiment with innovative approaches for predicting funding weights and enhancing model performance.

1 Like

Approach

Other submissions excelled at creating multi-feature models optimized against funding data. Some explicitly sought novelty by incorporating correlation penalties or adopting agent-based approaches. To further increase model diversity, this submission takes a principled rather than a purely statistical approach. Below are the principles guiding this submission:

  1. The repository dependency tree is the foundational data layer.
  2. AI will accelerate the rate of code generation.
  3. Funding should minimize rent-seeking behavior.

Principle 1: Dependency Tree as the Foundation

This submission’s algorithm relies solely on the dependency graph, excluding external factors such as previous funding, stars, social media presence, or other metadata. This ensures the model remains focused on the structural relationships between repositories.

Principle 2: AI-Driven Code Generation and Refactoring

This submission assumes that AI will not only speed up code implementation but also simplify refactoring and dependency reorganization, reducing the need for direct human labor. Over time, dependencies may become less about saving labor costs and more about ensuring data availability and trusted verification of code changes. Consequently, funding models should prioritize adaptability to AI-driven changes rather than replicating historical funding patterns. This submission keeps the model simple enough for humans to reason about how repositories might adapt to capture funding.

Principle 3: Minimizing Rent-Seeking

This submission considers not just the current state of the dependency tree but also changes in its state. New funding should only reward new updates, ensuring that repositories are incentivized to actively contribute rather than passively benefit from past work.

Weight Distribution Algorithm

  1. Calculate Scores:

    • For each project, compute the score as: Score = (Edge_Count + 1) * (Release_Count_Since_Last_Funding + 1)
    • * Last 90 days from today was used to generate the data for this submission.
  2. Calculate Total Score:

    • Sum the scores of all projects to get the Total Score.
  3. Determine Weights:

    • For each project, calculate its weight as: Project Score / Total Score
    • This ensures weights are proportional to each project’s contribution to the total score.

Performance Metrics

  • Mean Square Error in Part 1 Hugging Face Mini Contest: 0.17497461981091622
  • Mean Square Error in Part 2 Crypto Pond Mini Contest: 0.1741

Submission Repository

Link to Radicle Repository


Potential Gaming and Mitigation

The most obvious way to game this algorithm is for a repository to create unnecessary releases. To mitigate this, the algorithm could be refined to only count releases merged into the donor/judge’s selected root nodes and their dependencies. However, two-node repositories could still collude, with one node merging unnecessary releases. To address this, a hierarchical distribution of funds could be implemented, starting from the donor/judge’s selected root nodes. Future work could improve data availability around merged release counts and explore the game-theoretic implications of this principled model.


Conclusion

If the game theory around principled funding algorithms interest you, please reach out to collaborate on future rounds of this contest. If the number of merged releases seems like a valuable feature for your multimodal or agent-based model, I’d also like to collaborate. Currently, there is no straightforward way to query how often a repository updates its dependencies within a given timeframe, but this is an area worth exploring further.

1 Like

Deep Funding Machine Prediction and Cartography with the Omniacs.DAO

Our submission to the Deep Funding Mini-Contest leverages a high level feature embedding of the funded repositories in combination with standard best practices for modelling tabular data presented with a novel “Funding Map” for added insight into the funding performance of the projects on interest.

  • Part 1: Submission under “Omniacs.DAO” on Hugging Face with an MSE of 0.0145
  • Part 2: Submission under “Omniacs.DAO” on CryptoPond with an MSE of 0.0564
  • Code and Data: GitHub - OmniacsDAO/deepfundSubmission

Executive Summary

  • We used a combination of Github repo features in addition to an Nomic embedding representation of the repo readme files as inputs into a grid search optimized gradient boosting machine to achieve a top 10 placement on both part 1 and part 2 of the mini-contest.
  • We then applied linear and non-linear dimensionality reduction techniques to the embeddings to generate “Funding Maps” that highlight not only groups of similar repositories but also their funding amounts, highlighting both under and over performing projects.
  • As a contribution to the Deep Funding community, we open sourced all of our post processed data for easier replication of our modeling efforts. Our code has also been published as well.

Approach

In a quick “cookbook” format, our approach to developing models first consisted of collecting both training datasets from the HuggingFace and CryptoPond platforms. We then supplemented the data with information from the Open Source Observer using Bigquery, as well as, both procuring the project readme.md files via the github API and then extracting a vectorization of the text using the nomic-embed-text:v1.5 embedding model. We then utilized a grid search to find the optimal hyperparameters for a standard gradient boosted model. This approach resulted in top 10 placements on both leaderboards at the time of writing.

Deep Funding Cartography

To take our results one step further we leveraged manifold learning as a dimension reduction technique to create what we are calling a “Funding Map” of the projects included in the training set, but mapped out by the similarity of their readme.md files. These readme files, when vectorized, can serve as quantitative proxies for similarity once combined with a distance metric like Euclidean distance. Before the creation of the map; however, we had to first derive a proxy ranking of the funding performance of the repos. This was done by using a simple sum of the predicted weights from our model.

This allowed us to apply a simple coloring scheme where the top 20% are labeled green while the bottom 20% are labeled red. The resulting “Funding Map” is below.

One really cool insight highlighted from the map is that related repos don’t necessarily receive the same amount of funding.

In the section of the map above, the Coinbase Wallet SDK is highlighted as a green projects, while the WalletConnect monorepo designed to “Open protocol for connecting Wallets to Dapps” severely underperforms. We can see this not only from the raw scores…

… but also the head to head match-ups in the training data.

We took some extra time to bring our analysis to life by creating an interactive version of our Funding Map so others can explore the donation space we derived.

cartographyOD

In conclusion, we think our modeling efforts and statistical visualizations gets us all one step closer to more deeply understanding public goods funding and how we can better allocate capital to what matters in the space!

1 Like

Part 1
Measuring Open source project’s Impact
I focused on measuring impact of each projects. used V-index to measure impacts

V-Index is a measure of the open source project’s impact in the ecosystem.
(How to measure the impact of your open source project | Opensource.com)

This was borrowed from the h-index used in academia, but applied to measure the impact of open source projects through their dependency network.

V-Index is N where N is the number of first-order dependencies that have at least N second-order dependencies.

For example, if a project has 5 first-order dependencies, and among those:

  • Dependency A has 4 second-order dependencies

  • Dependency B has 3 second-order dependencies

  • Dependency C has 3 second-order dependencies

  • Dependency D has 2 second-order dependencies

  • Dependency E has 1 second-order dependency

Then the V-Index would be 3, since there are 3 dependencies (A, B, C) that have at least 3 second-order dependencies.

To calculate v-index, It needs,

  • Counts of first order dependencies of a open source project
  • Counts of second order dependencies of the first order dependencies

It should be the best to getting the dependents information in github insight, but it’s too large to collect them. I used data in OSO dataset instead. so it might not reflect all of the real world.

Here is my implementation: GitHub - Jake-Song/deepfunding-prediction

I started this work based on davidgasquez’s model and repository.

  • The v-index was calculated for each project and incorporated into the train/test datasets.
  • Upon analyzing feature importance, the log values showed no significant impact on the results.
  • However, the ratio values demonstrated strong relevance in predicting outcomes.
  • Filtered actual value and ratio of star, fork, issue and add value and ratio of v-index.

v-index seems to be worked well with the target.

But there are some issues,

  1. v-index can’t measure at a granular level. For example, If
  • Dependency A has 1 second-order dependencies

  • Dependency B has 0 second-order dependencies

  • Dependency C has 0 second-order dependencies

  • In this case, v-index is 1. But if dependency B or C has 1, also v-index is 1.
    If both dependency have 1, also v-index is 1. It needs a way to distinguish these cases.

  1. there are cases of big project has low v-index.
  • big projects(go-ethereum, revm, solidity etc.) aren’t likely to be dependencies.

Future work

  • modify v-index mechanism to measure at the granular level
  • extract features about impacts of big projects

Part 2
Embeddings + weighted features by time
I regarded embeddings as qualitative part and github activity numeric feature as quantitative part.

  1. How to get embeddings
  • I asked LLM to give overviews with categories of projects to me.
    • By Function or Domain
    • By Licensing Model
    • By Governance & Sponsorship
    • By Project Maturity & Lifecycle
    • By Community Size & Activity
    • By Technical Stack or Ecosystem
  • get embeddings of response LLM
  • model: gpt 4o-mini, text-embedding-3-small
  1. Weighted by time
  • add weighted features
  • set indices to quarters
  • 1st index is 2016-04, …, last index is 2024-10 including train and test
  • weights ranges from 0.5 to 1.
  • match indices to weights. (1st index has weight 0.5 and last index has weight 1)
  1. Training
  • train embeddings and weighted features altogether.
  • used Light GBM. it’s efficient and fast.

My implementation here:
Here is my implementation: GitHub - Jake-Song/deepfunding-prediction

Enhancing Funding Predictions with Bradley-Terry Models

Summary

This is an interim and partial response to part 1 of the mini-contest. The table compares three implementations of the Bradley-Terry model.

  • v1: A baseline model using Ridge regression without funding ecosystem or clustering features. It achieves an MSE of 0.083 but lacks transitive consistency.
  • v2: Adds funding ecosystem encoding, significantly improving performance (MSE = 0.025) by capturing ecosystem-specific funding patterns. However, it still lacks transitive consistency.
  • v3: Introduces K-Means clustering on GitHub metrics to group similar projects, aiming to improve transitive consistency and further reduce MSE. Results are work-in-progress (WIP), but clustering is expected to enforce logical consistency in predictions by smoothing strength scores within clusters.
Model Bradley-Terry Framework Funding Ecosystem Encoding Performance Cluster Membership Transitive Consistency MSE
v1 With Ridge regression - - No 0.083
v2 With Ridge regression Impact of funding ecosystem on weight distribution - No 0.025
v3 With Ridge regression Impact of funding ecosystem on weight distribution K-Means Clustering on Github Metrics WIP WIP

Why Bradley-Terry Model is a Good Fit for This Problem

Built for ranking and preference modeling, it is well-suited for problems involving relative comparisons, such as funding allocation. It is designed explicitly for pairwise data, making it ideal for comparing two projects at a time, which aligns with the structure of the dataset. It assigns a “strength score” to each project, reflecting its funding attractiveness, which provides clear insights into what drives funding decisions.

Hypothesis on how the Funding Ecosystem impacts weight distribution

Using the OSS Funding repository by OSO, you can classify 95% of the projects in the training dataset as securing all their funding from Open Collective. The remaining projects are primarily supported by ecosystems like Gitcoin Grants and Optimism RetroPGF.

In this article, I outline how funding ecosystems like Gitcoin Grants (quadratic funding) and OpenCollective (direct donations) demonstrate distinct reward distribution curves.

A quick sampling of the power-law distributions across GG20, GG22, and Optimism RetroPGF rounds (graph on the left) and quarterly Open Collective donations over the last 4 years (graph on the right) reveals the following:

  • After excluding justifiable exceptions, Gitcoin Grants and Optimism RetroPGF consistently display similar power-law characteristics, with the top 20% of projects capturing 60% or more of total funding.
  • In contrast, Open Collective follows an even more extreme power-law distribution compared to Gitcoin and Optimism, as evident from the log-scale Y-axis (also the reason why I could not accommodate all three ecosystems in a single visualization)

Click here for an interactive utility I built to review reward distribution curves of past funding rounds in these three ecosystems.

So what? This indicates that,

Even with all other factors being equal between a pair of projects, their relative weights can vary across ecosystems due to differences in reward curves shaped by contributor dynamics and the inherent biases of each ecosystem’s design.

The table provides a conceptual view of the feature matrix and target vector y used in the Bradley-Terry framework for the v2 model. Each row represents a pairwise comparison between projects, with features indicating project involvement (1 for the selected project, -1 for its competitor, and 0 for all others). The ecosystem column encodes additional context, such as whether the projects belong to a specific funding ecosystem (e.g., 1 for OpenCollective, 0 for all others). The target vector y captures the proportion of weight allocated to the first project in the pair, reflecting its relative strength or preference within the ecosystem.

lighthouse grandine walletconnect-monorepo ethers.js go-minisign ecosystem y
1 -1 0 0 0 0 0.5252782983
0 0 1 -1 0 1 0.3317168721
0 0 1 0 -1 1 0.00188679245

Next Steps - Why Clustering Can Help with Transitive Consistency

Both v1 and v2 suffer from transitive inconsistency , where predictions may contradict each other (e.g., A > B > C > A ). To address this, the next step is to use clustering algorithms. Here’s why clustering could help:

  • Projects in the same cluster are likely to have similar funding patterns, reducing inconsistencies in pairwise comparisons.
  • Projects in higher-ranked clusters are consistently predicted to outperform those in lower-ranked clusters, enforcing transitivity.

Besides, Clusters act as latent features that generalize across similar projects, improving the model’s ability to predict funding outcomes for unseen projects within the same cluster - this can come handy for Deep Funding in the main contest and scaling it beyond current scope.

PS: This post will be updated with latest findings and refactored code repo on implementing v3 as spec’ed above.

My repository submission for pond and hf deepfunding competition can be found at : GitHub - theokha/pond-funding-comp

Data Preprocessing for Better Model Generalization

Data enrichment

Initial train and test dataset provided on Huggingface overview section has very limited information. Using only these information as features is not enough to feed the model.

By gathering external information like repository metrics and social media engagement (not yet implementing this as of now) we can enrich the dataset features and will likely enhance the model performance, but have to choose carefully on which features we wanna use.

I gather all training and test data, including oss funding (dependency-graph/datasets/oso/oss_funding.parquet at main · deepfunding/dependency-graph · GitHub) to get all repository link.

Pond, HF, and OSS data created a repo link list contains 166 unique links. With github API, i gather all the metrics (with some of them inspired by davidgasquez’s work) and using beautifulsoup4 to get additional data like usedby, separate closed/open issues and pull&request.
After that, i simply join that metrics data with each train/test data in both Pond and HF competition.
*using separated closed / open metrics lowering the training MSE from 0.01326 to 0.01278. When i submitted in the HF platform, its lowering MSE from 0.0171 to 0.0141

image

Data oversampling

For HF competition, i also use training dataset from Pond and OSS funding (with lower ratio). And use the HF training data with higher ratio. One of the interesting method is implementing SMOTE and ADASYN to handle imbalance data (Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering | Journal of Big Data | Full Text)

Edit :

Dealing with skewed data

Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. In simpler terms, it describes whether the distribution of data is symmetrical or not. See the image below.

Using size and stars metrics of project_a, we can see that both metrics are positive skew.

Then how to handle this?
We can address this problem by logarithmic transformation. This method is particularly suited for right-skewed data, effectively minimizing large-scale differences by taking the natural log of all data points, resulting more symmetrical distribution. This compression of the data range makes it more amenable for further statistical analysis.

Summarize Readme file with BART

Aside from github metrics, readme file can be useful data to capture the context or description of repository, however, README files can often be lengthy and detailed, which can make them overwhelming for quick reference and need summarization. And i choose BART for this task.
BART (Bidirectional and Auto-Regressive Transformers) is a transformer-based model that has shown strong performance in various natural language processing tasks, including abstractive text summarization. Summarization result :

When it comes to features, repo’s summary should be handled differently than numerical and categorical data. I need to convert the summarized text into TF-IDF vectors before passing to regression model.

Utilizing repo’s readme summary lowering the MSE from 0.00703 to 0.00616

Future works

After reading some research article, i found that Özçevik, Yusuf dan Altay Osman develop a github repo url metrics data collection named MetricHunter . It using SourceMonitor to get various data. It stored unusual github metrics e.g. lines, statements, complexity, block depth (Özçevik and Altay 2023).


FYI using this app require sufficient time and storage as it downloads all repository to local storage, caltulate metrics then delete the repository data and create csv report.

(Liao et al. 2023) propose a GCN-based repository recommendation system, using both user and repository behaviour, that can capture unseen relation or pattern between repository. But i think its too complex and need huge computational resources as its a graph based with high dimensionality. See image below.

Source :

Liao, Zhifang, Shuyuan Cao, Bin Li, Shengzong Liu, Yan Zhang, and Song Yu. 2023. “Graph Convolutional Network-Based Repository Recommendation System.” Computer Modeling in Engineering & Sciences 137(1): 175–96. doi:10.32604/cmes.2023.027287.

Özçevik, Yusuf, and Osman Altay. 2023. “MetricHunter: A Software Metric Dataset Generator Utilizing SourceMonitor upon Public GitHub Repositories.” SoftwareX 23: 101499. doi:10.1016/j.softx.2023.101499.

Wang, Tianlei, Shaowei Wang, and Tse-Hsun (Peter) Chen. 2023. “Study the Correlation between the Readme File of GitHub Projects and Their Popularity.” Journal of Systems and Software 205: 111806. doi:10.1016/j.jss.2023.111806.

Cross commit features

Instead of writing a comprehensive overview of the model, it would be more useful for the community to do a deep dive on a family of features that aim to capture the essence of software dependencies: cause-effect relationships of code commits. The idea is to look to historical activity to extract information about relationships between projects.

The notebooks used to compute these features is at GitHub - HonorCodeDAO/deepfunding_rep: Efforts to support the deepfunding.org initiative. and we will continue to refine the strategies outlined below.

Why do this?

Put simply, if Project B depends on Project A, we would expect it to be more likely that B will have more code update activity after updates from A than before. This effect could be due to a number of causes, including maintenance, API changes, or new features. In short, there are aspects of dependency present in the coincidental activity that would not be found looking at each separately. It is difficult to think of any indicator more correlated other than the code itself. (Keeping in mind any measure will be expected to have a great deal of noise.)

How to Measure?

There are several possible approaches to extract these features. I’ve settled on two main ones:

  1. Pre/Post Ratio. Of all commits within some given timeframe, how many times does A precede B? This indicator counts up those A-commit, B-commit pairs and takes the ratio PPR = a_before_b / (a_before_b + b_before_a).
  2. Conditional within time frame. Instead of looking at all pairs, we ask the question: For each commit of B, how many occurred within X days of one from A? This metric aims to detect how likely it is that B’s activity comes after some activity of A, but not counting how much of A. So the ratio is based on percentages:

CWT = count_a_within_t_of_b / count_a

We can also do the same for B, and take another ratio between the two.

Furthermore, we can split the pre/post ratio into two parts: counting the pairs within months and comparing the times of each, and counting the pairs between consecutive months. Here, we have to make decisions about which time horizon we care about, and also be careful that the join doesn’t become too big. Since we have at least 100K commits over all projects, it is infeasible to merge the whole thing to itself so we need to make shortcuts. The join between months is much easier because we can count each project’s activity over each month. This will end up missing some cases, but the hope is that we can capture most of the tendency. For the “within” period, we’ve chosen 7 days, but this could be changed to any interval within the month.

What is the result?

While this submission is not optimized in terms of the best available features, we can examine the correlation of each metric with the provided weights. We do see positive effects, with a big caveat. Because the OSO DB is incomplete when it comes to coverage of the contest projects, over half of the pairs are missing data for at least one of the projects. Therefore, this indicator will be much weaker until we can resolve the issue.

For the pairs we do have, the correlations for Pond Data with weight_a are as follows:

PPR(within months): 0.15
PPR(between months): 0.21
Count commit within T of activity: 0.20

In addition, these features are very uncorrelated to any of the individual feature ratios like with stars or funding, which should help with model building.

These graphs show that there is definitely some signal showing up, although the large amount of missing data weakens the effect overall. We will continue the quest for the missing activity history and to move this effort forward.

Approach: Text Regression Using Transformers With One Line Summary of Projects

Our approach is a middle ground between agent-based methods and traditional feature-based approaches. We utilized text regression with transformers to analyze Git project logs and predict funding amounts.

Data Collection & Preprocessing

  1. Fetching Git Logs: We collected Git logs for each project.
  2. Summarization with Gemini: Due to its extended context length, we used Gemini to summarize the activity in the logs. This resulted in a concise, one-line summary of what each project does.
    • For well-known projects, an LLM can generate a summary directly.
  • Example:
    • “Prettier plugin for Solidity formatting, supporting the latest Prettier & Node.js with continuous improvements.”
  1. Final Training Dataset: The processed dataset was structured as follows:
    • Input Format: “Project A: description of project a, Project B: description of project b”
    • Label: weight_a

To augment the dataset, we doubled the training data by mirroring project pairs (i.e., swapping Project A and Project B).

Model Selection & Experiments

We experimented with multiple transformer models:

  • BERT (Best performing in our case)
  • RoBERTa
  • Longformer

Performance:

  • Using BERT, we achieved an MSE of 0.0206 on Hugging Face.
  • Alternative input formats were tested:
    1. Adding project metadata: “Project A: description of project a (star count: x, fork count: y), Project B: description of project b (star count: x, fork count: y)”
      • This did not significantly improve the MSE.
    2. Using a more elaborate description with Longformer:
      • MSE increased to 0.1, indicating longer descriptions may not be beneficial.

Example of a Detailed Description Used with Longformer

[Description]
A Solidity code formatter for Prettier.

[Milestones]

  • Standalone Prettier Support: Enables independent testing and broader compatibility.
  • Prettier v3 Support: Compatibility added for Prettier v3 alongside v2. Required logic adjustments in comment printing with backward compatibility tests.
  • ECMAScript Modules Migration: Project refactored to use ECMAScript modules.
  • Dropped Node 14 and 16 Support, Added Node 20: Improved performance and maintainability.
  • User-Defined Operators Support: Enhanced parsing and formatting capabilities.
  • Numerous Formatting Improvements: Enhancements in array indentation, return statements, function spacing, enums, structs, and more.

[Patterns]

  • Regular Dependency Updates: Indicates active maintenance and frequent version changes.
  • Focus on Compatibility: Continuous support for newer Prettier and Node versions.
  • Community Contributions: Contributions from external developers indicate strong community engagement.
  • Refactoring and Code Quality Improvements: Use of ESLint, testing, and code coverage tools demonstrates a commitment to quality.
  • Technical Debt Indicators: Frequent bug fixes point to complex parsing and formatting logic.

Conclusion

Our transformer-based text regression approach seems to effectively predict project similarity using Git logs. BERT performed best, achieving an MSE of 0.0206, while alternative formats and longer descriptions did not improve results. Future work could explore fine-tuning BERT further or testing other summarization techniques to enhance the dataset quality.

Reference

1 Like

Data-Centric AI Approach for Deep Funding Mini-Contest
Objective
Our goal was to predict relative funding among Ethereum open-source repositories using a data-centric AI approach. Instead of focusing solely on model tuning, we emphasized feature engineering and dataset enrichment by integrating data from GitHub, Open Source Observer, and GitIngest.
Methodology

  1. Data Collection & Preprocessing
    • Gathered repository metadata, funding history, and community engagement metrics.
    • Standardized timestamps and calculated time-based features (e.g., repo age, update frequency).
  2. Feature Engineering
    We extracted meaningful indicators to enhance model predictions:
    • Growth & Activity: Star and fork growth rates, release frequency, recent funding trends.
    • Community Health: PR merge ratios, issue close rates, and discussion engagement.
    • Project Maturity: Documentation completeness, project size, and governance indicators.
    • Ecosystem Impact: Dependency influence, language market share, and entropy of repository topics.
    • Scoring Metrics: Composite scores for popularity and sustainability, balancing activity, governance, and ecosystem impact.
  3. Model Training & Evaluation
    We trained several regression models and evaluated them using Mean Squared Error (MSE) and Root Mean Squared Error (RMSE):
    Model MSE RMSE
    Gradient Boosting (Best) 0.0532 0.2306
    SVR 0.0573 0.2394
    Random Forest 0.0581 0.2409
    Ridge Regression 0.1166 0.3414
    Linear Regression 0.1166 0.3414
    Lasso / ElasticNet 0.1512 0.3888
    Gradient Boosting achieved the best performance, but the overall improvement compared to previous results (raw features) was minimal (MSE improvement of only 0.004 on the POND validation dataset).
    Conclusion
    While our feature engineering and dataset refinement introduced a structured and scalable approach, the actual impact on prediction accuracy was marginal. This suggests that:
  4. Feature engineering alone may not significantly enhance funding predictions, as the dataset already contains strong signals.
  5. The quality and consistency of the original funding data play a bigger role in model performance than additional transformations.
  6. Further improvements may require alternative modeling techniques or additional external data sources.
    Despite the limited gains, this data-centric methodology ensures consistency, interpretability, and scalability for the upcoming Deep Funding contest. :rocket:

Building a Robust Model for Predicting Funding Allocations

In tackling the challenge of predicting funding allocations for Ethereum projects, I developed a structured pipeline that integrates graph-based learning, anomaly detection, external data sources, and advanced machine learning techniques. This approach ensures a highly predictive and mathematically consistent model while maintaining robustness against noise and inconsistencies.

Data Processing and Feature Engineering

The foundation of this model lies in effective data preprocessing. To enhance numerical stability, I applied logarithmic transformations and encoded temporal aspects using sine and cosine representations. These transformations help preserve cyclic patterns while preventing distortions caused by raw categorical encoding.

A crucial component of this pipeline is the construction of a dependency graph, where projects are represented as nodes and funding relationships as weighted edges. Using this graph, I extracted various network centrality metrics such as PageRank, eigenvector centrality, betweenness, and closeness. These features capture the relative influence of projects within the funding network, allowing the model to account for systemic dependencies in funding distributions.

Additionally, I integrated external GitHub repository data, which provides insights into the development activity of each project. By incorporating metrics like stars, forks, issues, and commits, I ensured that project traction is factored into funding predictions.

To further refine the dataset, I implemented an anomaly detection mechanism using the Local Outlier Factor (LOF). This step identifies and mitigates irregularities in funding distributions that might arise from one-off anomalies or inconsistencies in the historical data.

Model Optimization and Training

Once feature engineering was complete, I constructed a machine learning pipeline that leverages state-of-the-art regression models, including LightGBM, XGBoost, and Gradient Boosting Regressor. To achieve optimal hyperparameter tuning, I employed Optuna, an efficient optimization framework that systematically searches for the best model configurations.

Feature scaling and imputation were handled using RobustScaler and SimpleImputer, ensuring stability against outliers and missing values. I split the dataset into training and validation sets to prevent overfitting and ensure generalizability.

The ensemble approach combines the strengths of multiple gradient-boosting models, leading to superior performance by reducing variance and capturing diverse patterns in the data. The final predictions were obtained by averaging the outputs of the trained models, further improving stability and accuracy.

Ensuring Mathematical Consistency

One critical challenge in funding prediction is ensuring non-negativity and transitive consistency. To address this, I designed a corrective mechanism that detects negative predictions and retrains a targeted correction model. This step ensures that funding values remain within a valid range while preserving learned relationships from historical data.

Conclusion

By integrating graph-based insights, anomaly detection, external data sources, and a carefully tuned ensemble model, I created a robust and interpretable solution for predicting Ethereum project funding. This approach not only enhances predictive accuracy but also ensures the model remains mathematically coherent and adaptable to real-world funding dynamics. The resulting framework is well-equipped to assist in fair and efficient funding allocation decisions within the Ethereum ecosystem.

Approach to Predicting Relative Funding for Open-Source Repositories

Problem Statement

The goal of this project is to develop a machine learning model capable of accurately predicting the relative funding allocation for open-source repositories based on various input features. The primary objective is to minimize the Mean Squared Error (MSE) of the predictions, with a target of achieving an MSE score close to 0.01.

Dataset Overview

The dataset consists of training and test files containing project-related metadata, including project identifiers, funding details, and categorical attributes such as funders and quarters. The target variable is weight_a , which represents the proportion of funding allocated to a specific project.

Data Preprocessing and Feature Engineering

To ensure high model performance, extensive data preprocessing and feature engineering were performed:

  1. Handling Missing Values: Missing values in the target variable were replaced with the mean of the available values, and categorical features with missing values were assigned a placeholder category ('unknown' ).
  2. Encoding Categorical Variables: One-hot encoding was used for categorical features like funder , project_a , and project_b to ensure they are correctly processed by machine learning models.
  3. Log Transformation: Log transformation was applied to skewed numerical features to reduce the impact of outliers and improve model generalization.
  4. Feature Scaling: A StandardScaler was applied to normalize numerical features, ensuring that models that rely on distance metrics (such as gradient-boosting methods) perform optimally.
  5. KNN Imputation: Missing values in numerical columns were imputed using K-Nearest Neighbors (KNN) imputation to capture potential relationships between features.

Model Selection and Training

Three gradient-boosting models were evaluated and fine-tuned:

  1. XGBoost: Known for its efficiency and ability to handle missing values internally.
  2. LightGBM: Optimized for speed and performance with large datasets.
  3. CatBoost: Particularly effective with categorical data, requiring minimal preprocessing.

Hyperparameter tuning was performed using Bayesian Optimization (BayesSearchCV), with the following key parameters optimized:

  • n_estimators : Number of boosting rounds (100-1000).
  • learning_rate : Step size for updating weights (0.001-0.2).
  • max_depth : Depth of decision trees (3-15).

A 5-fold Stratified Cross-Validation strategy was used to ensure that the model’s performance was stable across different data splits.

Model Ensembling for Improved Accuracy

To further reduce the MSE, an ensemble approach was employed:

  1. Averaging Predictions: The outputs of the three models were combined using a weighted averaging approach to reduce variance and bias.
  2. Stacking Ensemble: The predictions from individual models were used as features for a meta-model, further refining the final predictions.

Results and Performance Evaluation

Through iterative improvements, including better feature engineering, optimized hyperparameters, and ensemble learning, the final model achieved a significantly lower MSE compared to initial models. The refined approach brought the MSE closer to 0.01, demonstrating the effectiveness of the enhancements.

Future Work and Optimizations

While the current model is optimized for accuracy, further improvements could include:

  • Feature Selection Using SHAP Values: To identify the most impactful features and reduce model complexity.
  • Time-Series Analysis: If historical funding data is available, using a time-series approach could further improve predictions.
  • Deep Learning Methods: Exploring neural networks, particularly attention-based architectures, for complex feature interactions.

By implementing these enhancements, the model can be further optimized to achieve even higher accuracy in funding prediction tasks.

1 Like

Approach to Predicting Relative Funding for Open-Source Repositories

github - Funding-Ethereum/Another copy of ETHEREUM.ipynb at main · dapslegend/Funding-Ethereum · GitHub

We tried various supervised learning approach for this because we thought someone achieved 0.02 in cryptopond through supervised learning so we kept pushing and that is probably why i appeared twice on the leaderboard.
22 Ayodapo “D4PS” Adesiyan 0.0597 and 24 D4PS 0.0605 using the same model but different lightGBM tuning parameters

Part 1

Given the dataset, the only thing i extracted from Github was Github stars i did notice a few improvements in the model with Github stars just a significatly lower MSE, i started the competition late i didnt have enough time to try out other github Features like size , watchers , forks , issues, it probably would have lowered the MSE even more.

def get_github_stars(owner, repo):
url = f"https://api.github.com/repos/{owner}/{repo}"
response = requests.get(url)
data = response.json()
return data.get(“stargazers_count”, 0)

for index, row in dataset.iterrows():
owner_a, repo_a = extract_owner_repo(row[‘project_a’])
owner_b, repo_b = extract_owner_repo(row[‘project_b’])

dataset.at[index, 'stars_a'] = get_github_stars(owner_a, repo_a)
dataset.at[index, 'stars_b'] = get_github_stars(owner_b, repo_b)

LGM PARAMETERS

In my other notebook we tried using grid search to find the best params bu we realized manual worked better. The learning rate was reduced with Increased boosting rounds this was far better
than when the learning rate was 0.1 with lower boosting rounds. Also the min_data_in_leaf improved the model by preventing overfitting.

PREDICTION

lightGBM was able to predict weights for each projects using trained features in features_to_keep not any other external features all other features was dropped. A lot of github feature engineering is required to improve the model and make more predictions with lower MSE.

Also we tried to Ensure transitivity in predictions but we we didn’t depend on that result since our model works by predicting weights each projects should get, even without directly enforcing transitivity its enforced in its predictions.

IMPORTANT FEATURES

features_to_keep = [‘project_a’, ‘project_b’, ‘total_amount_usd’, ‘stars_a’,‘stars_b’]

test[‘weight_a_pred’] = model.predict(test[features_to_keep])
features_to_keep = [‘project_a’, ‘project_b’, ‘total_amount_usd’, ‘stars_a’,‘stars_b’]

test[‘weight_a_pred’] = model.predict(test[features_to_keep])

LIghtGBM PARAMS

params = {
‘objective’: ‘regression’,
‘metric’: ‘mse’,
‘boosting_type’: ‘gbdt’,
‘learning_rate’: 0.02, # Reduced learning rate
‘num_leaves’: 31, # Adjusted num_leaves
‘max_depth’: 5, # Limit tree depth
‘min_data_in_leaf’: 150, # Increase min_data_in_leaf
‘feature_fraction’: 1, # Use a subset of features
‘bagging_fraction’: 1, # Bagging
‘lambda_l1’: 0.01, # L1 regularization
‘lambda_l2’: 0.01, # L2 regularization
‘verbose’: -5
}

Train the model

model = lgb.train(params,
train_data,
valid_sets=[test_data],
num_boost_round=20000, # Increased boosting rounds
callbacks=[lgb.early_stopping(stopping_rounds=1000)])