Fintech · June 3, 2026 · 8 Min Read

    Can We Trust AI With Our Money?

    From M-Shwari to graph neural networks for fraud: a technical and governance look at where algorithmic finance earns trust — and where it fails.

    Can We Trust AI With Our Money

    Can We Trust AI With Our Money?

    In Kenya, the first well-documented algorithmic credit scorecard did not run on credit-bureau data, because there wasn't any worth the name. When Safaricom and Commercial Bank of Africa launched M-Shwari in 2013, the model underneath it ate something else entirely: call-detail records, airtime top-up cadence, mobile-money velocity, SIM-registration data cross-referenced against the government's biometric population registry. A person with no bank account, no payslip, and no collateral could be underwritten in seconds by a scorecard that had quietly decided their handset behavior was a better default predictor than anything a loan officer could have asked them in a branch. By the end of 2014 it had disbursed 20.6 million loans to 2.8 million borrowers.

    That is the actual shape of the question "can we trust AI with our money." It is not a philosophical worry about handing the keys to a machine. The machine already has the keys, it has had them for over a decade, and across Lagos, Cairo, Nairobi, Jakarta, and Dhaka it is approving and declining millions of people a day on signals most of them have never seen. The real question is sharper and more uncomfortable: trust it to do what, measured how, with what recourse when it is wrong? That is an engineering question and a governance question before it is an ethical one — and the markets we are talking about are now answering it in production.

    What the model actually does that a human cannot

    Start with why anyone reached for the algorithm in the first place, because the inclusion story is usually told sentimentally and it deserves to be told technically.

    The binding constraint in emerging-market lending is the thin file. Traditional scorecards — logistic regression on bureau tradelines, debt-to-income, utilization — are useless when 50-70% of adults have no bureau footprint at all. There is nothing to regress on. The breakthrough was not "AI is smarter"; it was that machine-learning models can ingest high-dimensional, sparse, non-traditional feature sets and still rank-order risk. Gradient-boosted decision trees — XGBoost, LightGBM, CatBoost — became the workhorse precisely because they handle hundreds of weak, correlated, partially-missing alternative features without the analyst hand-specifying every interaction.

    The lift is real and measurable. In a controlled study on the Home Credit dataset, expanding from 217 to 767 engineered features and selecting them with a model-X knockoffs framework pushed a LightGBM scorecard to an AUC around 0.79, with the alternative-data features delivering statistically significant gains (DeLong test, p < 0.05) over the bureau-only baseline. Industry telemetry tells the same story: best-in-class scorecards now clear a Gini of 0.75 against a consumer-lending average near 0.67, and one documented deployment moved Gini from 0.66 to 0.87 — a 31% jump in discriminatory power — while cutting credit losses 24% and increasing approvals to thin-file customers by 18%. That last number is the whole argument in miniature: better ranking power means you can say yes to more people and lose less money simultaneously. Those are not competing goals when the model is genuinely separating goods from bads; they are the same goal.

    This is why, by the CBN's 2025 Fintech Report, AI in Nigerian fintech has crossed from experiment to infrastructure — roughly 37.5% of firms now run it for credit scoring and risk, and FairMoney and Carbon issue decisions in minutes on airtime patterns, utility cadence, and social-commerce transaction history. Across the continent the modeled credit gap is on the order of $330 billion. You do not close a gap that size with more loan officers. You close it with a scorecard that can underwrite a $40 ticket profitably at near-zero marginal cost.

    Fraud is where the case for trust is strongest

    If credit scoring is the contested frontier, fraud detection is where AI has more or less won the argument outright, and it is worth being specific about why.

    Rule-based AML and fraud engines fail in a structurally predictable way: fraudsters learn the thresholds and operate just beneath them — structuring transactions below the $10,000 reporting line, fragmenting amounts, drifting geographies. Static if-then logic cannot model relationships, only conditions. The shift to machine learning, and specifically to graph neural networks, changed the unit of analysis from the transaction to the network. A GNN aggregates signal across multi-hop neighborhoods of accounts, so it surfaces the mule ring and the synthetic-identity cluster that no single transaction would ever trip. On the benchmark Elliptic blockchain dataset, ensemble GNNs detect over 70% of illicit transactions at a false-positive rate under 1%; an RL-GNN fusion on the IEEE-CIS data hit 0.872 AUROC with a 33% reduction in false positives versus the GNN baseline, at ~42ms inference latency — fast enough to score inline, before authorization.

    That false-positive number is not a vanity metric. Every false positive is a frozen card, a declined legitimate payment, a customer-service call. Major banks deploying ML scoring report false-positive reductions of 60-90%, which is simultaneously a fraud-loss story and a customer-experience story. In Nigeria — tens of millions of digital transactions a day, with SIM-swap, account-takeover, and synthetic-ID vectors that mutate weekly — 87.5% of surveyed fintechs now run AI primarily for fraud. There is no rules engine on earth that keeps pace with that. This is the part of the ledger where "trust the AI" is not aspirational; it is the only defensible position.

    Now the part the pitch deck skips

    A persuasive case that cannot survive its own failure modes is propaganda. So here is the failure mode, in numbers.

    The same architecture that includes at scale can extract at scale, and it has. In Kenya, 2021 APR estimates ran to 91% on M-Shwari, 180% on Tala and Branch, and a frankly predatory 442% on KopaCash. M-Shwari borrowers self-reported digital-loan default at around 14% against 8% for other channels; by 2017 an estimated 2.7 million Kenyans had been negatively listed with credit-reference bureaus over digital loans — "blacklisting" that can mean long-term exclusion from all semi-formal credit. And the welfare evidence is genuinely mixed: the strongest RCT, Suri et al. on M-Shwari, found that crossing the eligibility threshold cut the probability of foregoing an expense by a few points — real consumption-smoothing — but the broader synthesis across Kenya, Malawi and Nigeria concludes digital credit does not systematically improve lives, while opacity around terms invites predation. A Mexican deployment showed default rates near 27%. Indonesia's OJK logged rising P2P default ratios through Q1 2025.

    Then there is the collections layer, which is where algorithmic lending turned genuinely dark. The standard predatory stack is well-documented across a 434-app cross-country measurement spanning Indonesia, Kenya, Nigeria, Pakistan and the Philippines: an app that demands READ_CONTACTS, camera, and storage permissions as a precondition of disbursement, then — on default — weaponizes that social graph, mass-messaging a borrower's contacts with defamatory shaming. In India, fraudulent-loan-app complaints rose sharply through 2025 and at least one student's death was tied to harassment by a lending app. Kenya's competition authority logged a 28% year-on-year jump in lender complaints.

    And underneath all of it sits the black-box problem that should worry any practitioner. A gradient-boosted ensemble over 300 features is not natively explainable. If it declines an applicant, "the model said so" is not an adverse-action reason — it is an unfalsifiable verdict. Models trained on historically biased repayment data will faithfully reproduce that bias, redlining by proxy through features correlated with geography or gender, with no human ever having chosen to discriminate. The amplifier does not care which direction you point it.

    Trust is an architecture, not a sentiment

    Here is where I land, and why I land on the optimistic side despite all of the above.

    Trust in a financial system has never meant believing the counterparty is virtuous. It means the system is engineered so that good behavior is enforced and bad behavior is detectable. We do not trust banks because bankers are honest; we trust them because of capital requirements, audit, deposit insurance, and a regulator with subpoena power. The question for AI is identical: is the scaffolding being built? And in 2025-2026, across exactly these markets, it demonstrably is.

    The toolkit is converging on four pillars, and they are technical, not rhetorical. Explainability — SHAP values and reason-code generation are moving from nice-to-have to regulatory baseline; the principle that "the algorithm decided" is no longer a legal defense has hardened across jurisdictions, meaning every adverse action needs a human-legible reason. Consent and data minimization — Nigeria's FCCPC DEON regulations, effective July 2025, prohibit the contact-list harvesting that powered the harassment stack and mandate plain-language disclosure of all terms before disbursement; the CBN's open-banking framework classifies credit-scoring data as "high and sensitive risk" with standardized APIs and strict consent management. Supervision that is itself algorithmic — the CBN is moving toward SupTech, regulators running their own models to audit fintech scorecards for bias and stability in near-real-time, which is the only supervisory approach that can actually keep pace with the thing it supervises. Controlled experimentation — Morocco's Bank Al-Maghrib sandbox and Egypt's FinTech Hub and pending AI-governance bill let novel models be tested under observation rather than loose in the wild; Egypt's instant-payment rails (InstaPay: 1.1 billion transactions, ~EGP 2.4 trillion in H1 2025 alone) show what supervised scale looks like.

    This is the difference between the two versions of the same technology. The version that reaches the Kano trader and the Casablanca shopkeeper — explainable, consented, supervised, with a real adverse-action reason and a redress channel — is the most powerful inclusion engine financial services has ever produced. The version that hides behind an unauditable ensemble and exfiltrates a contact list is a debt trap wearing the same clothes. The model weights are nearly identical. Everything that distinguishes them is design and governance.

    So — can we?

    Yes, conditionally, and the condition is the entire substance of the answer: we can trust AI with our money exactly to the degree that we refuse to trust it blindly. Trust scales with auditability, not with accuracy. A 0.87-Gini scorecard you cannot explain is more dangerous than a 0.70 one you can, because the first one is wrong with confidence and no paper trail.

    The mature posture is not human or machine. It is the machine doing what it is unambiguously better at — ranking risk across high-dimensional sparse data, catching fraud across transaction graphs in milliseconds, underwriting at a marginal cost that finally makes the poor bankable — bounded by humans holding the questions a model cannot answer about itself: Is this fair? Can you explain this specific decline? Who does this feature set systematically exclude? What happens to the borrower we got wrong?

    We humans built these systems. We chose the objective function, the training data, the permission scopes, the collections policy. That authorship does not evaporate because the decision now happens in 42 milliseconds — it concentrates. The markets getting this right in 2026 are not the ones that trusted the AI most. They are the ones that never stopped interrogating it.