On-chain Subjective Oracles: Ranking Systems

APR 28 2023

Ranking Systems in Web2

Web 2 ranking systems consist of some combination of closed source algorithms and private data which undermine trust in the system due the perverse incentives of the middlemen.

Any ranking system consists of the following components:

Subjective item to be rated (ex: services / goods / performances)
Samples / observations (ex: star-rating / UI actions / purchases)
Rater identity (ex: verified purchaser)
Rating model (ex: uniform distribution)
Model fitting algorithm (ex: gradient descent)

Web2 ranking systems are rarely closed on all axes, but are never open on all. An open ranking system makes all intermediary data and algorithms associated with deriving the final ranking signal available to all participants.

Marketplaces like Amazon or Doordash publicly show the samples (reviews) and the resultant average, but require you trust the marketplace not to censor or manipulate results. They do not provide information about the identity of participants or incentives involved. Users must trust the marketplace to remove voters with exogenous interests (paid reviews) and those creating sybil attacks (many accounts to represent a single user). Critical users of these marketplaces will know they've fallen short, without recourse. Often users must fall back to subjective ranking through reading text reviews in an attempt to derive their own signal.

Ranking systems involving people such as gig economy services, corporate human resources systems, and academic grading suffer from grade inflation and gradually decay to a bimodal distribution. Often humans don't want to deal with the cost associated with honesty. In the case of an HR system this can be due to participants seeking to avoid retribution from peers. Whatever the reason, 5 star systems that fall to 1-star (terrible) or 5-stars (not terrible), lose a tremendous amount of dynamic range and limit the efficiency that could be provided from more reliable polling.

Due to these failures, we cannot apply existing ranking systems as an oracle for social consensus when the stakes are high.

If we start from the principle of trustlessness, can we create a ranking system that does not run into the pitfalls of web2? The goal of this project is to explore a ranking system that can be deployed by a group of participants who do not necessarily trust one another to come to consensus over the ranking of an arbitrary set of subjective works via public sampling and public modeling. A fitting place to deploy a system meeting these goals would be a censorship resistant verifiable compute layer such as a blockchain.

Ranking Systems in Web3

While it's still early to dream of the use cases here, there are some areas of the web3 landscape that are in clear need of a primitive of this nature today.

Some examples follow.

Trustless Web3 marketplaces. In the absence of legal recourse, how can participants trust one another?

DAO work distribution. DAOs depend on binary votes put forward by an individual in the community and lack any algorithmic granularity. Most decisions are to "fund or not fund", "deploy or not deploy". We promise a future where much of the bureaucracy and operational cruft is mitigated thanks to algorithmic rails, yet today decentralized organization operations are executed by social consensus. If we were able to come to consensus on the subjective rank of work output of a DAO, decisions and resource allocations can be made more granular than binary.

Public goods funding. I would venture to say that the majority of instructions executed by computers globally come from free open source software. Yet most important open source software is maintained and supported altruistically by groups with high margins (big tech). Donations towards these efforts are typically executed trustfully: pay out one party, they distribute unilaterally. If commits (code improvements) could be truthfully ranked by the participants themselves, public goods funding could automatically be allocated to those that participated most. A market could develop around open source software.

Joint creation of IP or art. If contributions in the form of blog posts or artistic samples can be ranked by the community itself, they can enshrine a distribution of contribution directly in the protocol for use with other economic incentives.

Scoping the System

Binary ranking

The majority of ranking / survey systems today depend on samples from participants in the form of an absolute score. There's lots of reasons humans are bad at this, from the cost of truthfulness to an inability to fit a uniform distribution from memory. For subjective interactions such as scoring direct reports in an HR system, how can a manager be expected to boil down the multivariate complexity of human performance to a score? Humans are more willing to provide critical feedback textually, without an absolute score associated. Textual feedback provides another level of subjective indirection and does not get us closer to the end of a numeric, uniform and confident ranking system.

On the contrary, a binary sampling system might. It's much easier to rank two items sampled randomly from a set pairwise than it is to give each an absolute score sequentially. This will be discussed in greater detail in the section on "sampling" method below, but is the basis for many of the other decisions / avenues of research.

Defining The System

A ranking system can be parameterized by:

Set of items to be ranked ("Units of work"): $U$
Samples: $S$
Sampling method: $g(U_i, U_j) \to S_n$
Probability model: $F(\theta | S)$
Recursive model: fitting algorithm: $\theta_{n+1} = train(\theta_n)$
Set of raters providing samples: $R$
Set of workers providing units of work: $C$

I'll briefly explore my current thinking and the design space for each of these items, but note all of these are open research areas.

Set of "Units of work"

The ranking system is defined by $R$ raters responsible for committing samples which consist of their preferences over the dataset of $U$ "units of work" created by contributors, $C$ . Examples for these units of work could be "jazz performances", "code contributions" or "forum posts". The only requirement is that they're directly subjectively comparable. Additionally there should be a general prompt along the lines of "Which provides the most (value / joy) to you?" to which the the raters $R$ provide a sample. An ideal set of $(U, g, R, C)$ , has high consensus about the ranking amongst a randomly selected pair of $(U_i, U_j)$ ; The threshold for which is an open question.

Even in the case of low consensus, the methodology suggested below provides both a model fit over the samples and the geometric margin between a given set of samples and the final model. Using this property honest users can adjust their interactions with the system economically or socially. A clear failure mode here would be to incentivize ranking towards the perceived mean rather than personal preference.

Sampling Rankings and Modeling

It's important to define the interaction of a participant providing ranking / preference samples. As mentioned above, humans are bad at providing contentious ratings on an absolute scale. If we can break down the process of "rating" each unit of work into individual pairwise "rankings" we may be able to obviate all of the problems associated with subjectivity and multivariate comparisons.

While it's hard to say on an absolute scale the score of a ballet performance, it is likely easy to rank 2 performances.

Binary pairwise rankings can be represented by a Bradley Terry probability model. The expanded ranking to allow K items per ranking sample is known as Plackett Luce. These models are well studied and have known algorithms for fitting from a set of samples.

Of course this is not the only sample collection strategy, but is chosen for its simplicity within a complex system.

Fitting The Model

Once samples $S$ have been collected, we can verifiability fit the model on-chain to create a ranking between all of the $U$ units of work. These model fitting algorithms are generally iterative and quite expensive (think gradient descent). We can move some portion of the compute off-chain and submit succinct zk-proofs of model parameter updates. If storing the $S$ samples on-chain is too expensive we can move these off-chain and only post a commitment to this data on-chain. Succinct proofs can compute a commitment and the verifier can ensure the correct data was used during model update.

A secondary consideration must be made when fitting the model: If the model is continuously updated with new samples, the gradient descent algorithm becomes Stochastic. The convergence properties for Stochastic gradient descent amongst most models are not as strong as batch. It is possible that an adversarial attacker could intentionally provide updates to move the weights to a local minima to better suit their needs. Without other design decisions finalized it's hard to say the magnitude of this problem, but it should be carefully considered when formulating.

Identity: Worker / Rater Incentives

In many uses of a ranking protocol, there is a need to incentivize the ranking group. If there are $U$ units of work, ideally you'd have $U^2-U$ rankings (this is arbitrary by Bradley Terry practitioners, as the "30 matches" required for Elo). This can quickly grow to a non-trivial amount of required rating. If the protocol does incentivize sample collection, extra safeguards have to be made for the case of lazy voting: participants ranking randomly or via lookup of existing votes.

In the case that there is overlap between the $R$ raters and the $C$ work contributors, the protocol must be conscious of selfish voting: participants ranking the subset of their work favorably despite their truthful subjective opinion.

Depending on the combination of design decisions made for the incentives of the system, these problems can be mitigated in a few different ways.

Privacy: By hiding existing samples before model fitting, smart lazy voting is prevented.
Deanonymization: If we can ensure the sets $R$ and $C$ are independent we need not worry about selfish voting.
Slashing: If raters are required to put up some stake in advance, we can slash in the case that their samples fall outside some standard deviation of the mean.
Social: If the group of participants knows and trusts each other, you may just depend on social consensus and secondary reputation to prevent poor voters.

It's probably clear that most implementations of a trustless ranking system would fail with a high proportion of adversarial actors within raters $R$ or work contributors $C$ groups. This research optimistically focuses on environments where one can assume exogenous motivators for cooperation. Examples: a group of academics attempting to decide research paper importance, or Open Source code contributors determining the value of code contributions.

At the very least, each contributor will be able to determine the geometric mean of their contribution to the jointly fit model. If they were an honest contributor, they can conclude the model and group of raters have been sullied in some capacity and take out-of-protocol actions.

Dream: With a sufficiently large group $C$ contributing roughly uniformly to create $U$ units of work, rated by $R$ raters which are a subset of the $C$ contributors, we can create a trusted and reliable system for ranking $U$ .

Amendment: Stake-weighted

While I'm not sure how to fit this within the existing models yet, one could imagine that the contribution $S_j$ to fitting the model $f(\theta | S)$ weighted by the reputation or stake of the rater $R_i$ which provided the sample $S_j$ . The result would be a stake-weighted reputation system which might have better properties against adversaries thanks to incentive compatibility.

Conclusions

Directions this could be taken

Library On-chain Bradley-Terry / Plackett-Luce sampling and model fitting (both on-chain and via succinct zk-proof)
General solidity library with hot swappable modules for sampling, sample storage, model, model fitting, prediction. The goal would be to kick start an ecosystem for on-chain ranking / reputation systems with common interfaces and modules.
Experiments and understanding on the human time complexity of a given ranking system and how it corresponds to quality amongst different units of work (Git commits vs Art)