Comparing Context Retrieval Approaches for AI Code Review

Overview

At Compare The Market, we have an internal AI tool that automatically reviews merge requests. The goal is to speed up the time it takes for developers to receive feedback on their code and increase MR throughput across the organisation. Developers get an intelligent first-pass review within minutes of opening an MR.

It works well, but we wanted it to work better. The reviewer was leaning on the side of conservatism, where we often saw false positives in bug detection due to the limitations of reviewing code changes in isolation. We wanted to provide the reviewer with an understanding of how the change sits within the broader system. For example, a deleted function might seem like dead code, unless you know it’s called dynamically from another service.

At this point we faced a fundamental architectural decision: How should the agent retrieve context about the codebase?

We had two main options:

GKG (GitLab Knowledge Graph): A code analysis engine that uses Tree-sitter AST parsing (via gitlab-code-parser) to build a structured knowledge graph of code entities and relationships, stored in a Kuzu graph database. Enabling precise queries like “find all callers of this function” or “show me the class hierarchy”
RAG (Retrieval-Augmented Generation): A vector similarity search approach that chunks code, creates embeddings, and retrieves semantically similar code snippets.

We chose GKG based on intuition – our hypothesis was that code review requires structural understanding of code relationships, not just semantic similarity. When reviewing a change to a function, you need to know what calls it, what it calls, and how it fits into the broader architecture. RAG excels at finding “similar” code, but similarity isn’t the same as relevance for code review.

This article validates that intuition. Through rigorous evaluation using MLflow on Databricks, we compared four approaches and found that GKG outperforms RAG on the metrics that matter most for code review quality. The data confirms our architectural decision was correct.

1. The 4 approaches

We evaluated four distinct configurations:

2. GKG Integration

What is GKG?

Last year, GitLab introduced a beta version of an MCP (Model Context Protocol) server called the GitLab Knowledge Graph (GKG). The API indexes the repository and builds a structured, queryable representation of the codebase. It maps dependencies onto nodes in a graph, understands function definitions and their usage, traces inheritance hierarchies, and captures cross-references between modules.

The result is a semantic map of your code – not just a list of files, but a web of relationships. Through the tools provided by the MCP server, AI agents can query this graph in real time:

“Where is this function called?”
“What classes inherit from this interface?”
“What would be affected if I changed this method signature?”

Our Sidecar Integration
Because GKG was still in beta and not yet available as a native GitLab CI/CD feature, we built a separate sidecar service – a lightweight Docker container that wraps the official GKG binary and runs alongside our reviewer in the CI pipeline.

The Workflow
Index – When a merge request pipeline kicks off, the sidecar container mounts the project source and indexes the full codebase, building the knowledge graph from scratch.
Serve – Once indexed, it starts the GKG MCP server on a local port, exposing a set of tool calls.
Query – Our AI reviewer connects to the MCP server and uses these tools as part of its review workflow.

How the Knowledge Graph Works
GKG builds a symbol graph — a structured representation of your codebase where nodes represent code entities (classes, functions, variables) and edges represent relationships (calls, inherits, imports).

The Indexed Project Graph
When GKG indexes a repository, it creates an interactive graph visualisation showing the entire codebase structure. Here’s what the indexed graph might look like for an example project:

Each node type represents a different code entity:

Orange (Directory) – Folder structure of the repository.
Green (File) – Individual source files.
Purple (Definition) – Classes, functions, and methods defined in the code.
Blue (Imported Symbol) – External dependencies and imports.

The edges (lines) show relationships: which files contain which definitions, which functions call other functions, and which modules import which symbols.

Example: Symbol Graph Structure
Consider a simple UserService class. GKG maps it as a graph showing the class, its methods, and all the files that call those methods:

Example: Querying the Graph
When the AI reviewer needs to understand the impact of a change, it queries GKG. For example, if someone modifies validate_input(), the agent asks: “Who calls this function?”

This precise information allows the reviewer to assess whether a change to validate_input() could break any of these three callers – something impossible with just the diff alone.

How GKG Improves Review Quality

Now the reviewer is able to utilise GKG to query the knowledge graph to verify the initial concerns our discovery agent identifies against the wider codebase. For example, if a change looks like it might break a contract, the agent traces the dependency chain to verify before reporting it as an issue.

By verifying initial concerns against the actual structure of the codebase, the reviewer produces more accurate, more consistent feedback, and critically, fewer false positives in bug detection.

GKG Architecture Overview

3. RAG Integration

Our RAG implementation uses LlamaIndex for intelligent code chunking and OpenAI embeddings for vector similarity search.

Phase 1: Indexing Pipeline

The indexing pipeline:

Source Files: Scan repository for code files (.py, .js, .ts, etc.).
CodeSplitter: LlamaIndex AST-aware chunking.
Code Chunks: Semantic units (functions, classes) with overlap.
Embeddings: OpenAI text-embedding-3-small.
FAISS: Vector store for fast similarity search.

Phase 2: Retrieval Pipeline

LlamaIndex CodeSplitter: Graph-First Approach

The key component of our RAG implementation is the CodeSplitter from LlamaIndex. Unlike naive text chunking, CodeSplitter uses a graph-first approach: it first builds an AST (Abstract Syntax Tree) graph of the code, then uses this structural understanding to create semantically meaningful chunks.

Step 1: Build the AST Graph

Before any chunking happens, Tree-sitter parses the source code into a hierarchical graph structure. This graph represents the syntactic structure of the code – functions, classes, methods, and their relationships.

Step 2: Traverse Graph to Create Chunks

Once the AST graph is built, CodeSplitter traverses it to identify semantic boundaries. Instead of blindly splitting at line 40, it finds natural break points:

Function boundaries: Each function becomes a chunk (or multiple if large).
Class boundaries: Class definitions with their methods.
Logical groupings: Related code stays together.

Why RAG May Struggle with Code Review

The Fundamental Limitation: RAG relies on semantic similarity in vector space, which works well for natural language but has inherent limitations for code:

1. No Symbol Resolution
RAG cannot distinguish between a function definition and a function call with the same name. It treats def process_data() and process_data() as semantically similar without understanding the relationship.

2. No Reference Tracking
When reviewing a change to a function, RAG cannot reliably find all callers of that function. It may return semantically similar but unrelated code.

3. Chunk Boundary Issues
Even with AST-aware chunking, important context may be split across chunks. A function’s signature might be in one chunk while its implementation is in another.

4. Precision vs Recall Trade-off
RAG optimises for semantic similarity, which may surface “related” code that isn’t actually relevant to the review. This can overwhelm the LLM with noise.

4. Evaluation Setup

The Challenge: Measuring Quality Without a Single Correct Answer

Evaluating a generative AI reviewer is fundamentally different from evaluating a classifier or a search engine. A code review has no canonical correct output – two expert engineers reviewing the same merge request will write different, equally valid feedback. This means standard accuracy metrics don’t apply, and simple text comparison is meaningless.

The challenge compounds when you consider that quality is multidimensional. A review might identify the right risks but overstate severity. It might be perfectly calibrated in scoring but miss the one critical issue. It might surface the correct concern but point to the wrong line of code. Each of these failure modes matters independently, and optimising for one can actively degrade another. This is what makes rigorous AI evaluation hard: the outputs are open-ended, non-deterministic, and semantically rich.

Ground Truth: A Curated Reference Dataset of 79 MRs

To evaluate reliably, we need a fixed reference point. We constructed a golden dataset of 79 real merge requests from our internal repositories, a structured benchmark where each entry carries expert-annotated ground truth across multiple dimensions:

Expected summary points – the key changes and their implications.
Expected issues – risks, bugs, and concerns that a thorough reviewer should surface.
Expected inline comments – specific defects, tied to exact file locations and code lines.
Expected score range – the range of severity scores a calibrated human reviewer would assign.

Constructing this dataset required deliberate effort. The MRs represent real production complexity, covering different codebases, change sizes, and types, rather than easy cases favoring any approach. Each entry was annotated with a consistent rubric defining a thorough, well-calibrated review: identifying the right issues, flagging the correct lines, and assigning a severity score reflecting genuine impact.

Scoring: Core Quality Dimensions

All four approaches (Baseline, GKG, RAG, GKG+RAG) were evaluated on the same 79 entries. Each generated review was assessed across five core quality dimensions:

Coverage – How many expected summary points, issues, and inline comments appear in the output? A review missing a critical bug scores low here regardless of addressing other issues.

Precision – Of everything the reviewer surfaced, how much was actually warranted? This penalizes over-generation: a review flagging 20 issues when 5 were expected may have good coverage but poor precision, indicating a noisy, unreliable reviewer.

Inline comment location accuracy – Code review is inherently spatial. It’s not sufficient to identify a problem in prose. It needs to be attached to the correct file and anchored to the relevant code change. We verified this separately from semantic correctness.

Score calibration – The reviewer assigns a severity score from 0 – 10. We measured whether that score fell within the range a human would consider reasonable. This matters for developer trust: a reviewer that consistently over or under penalises loses credibility quickly.

Structural validity – a schema check ensuring required fields are present and well-formed (e.g., non-empty summary and score, inline comments include required fields). All approaches consistently passed this check in our runs.

LLM-as-Judge: Semantic Evaluation at Scale

For nuanced criteria, such as whether a generated summary “covers the same point” as a reference, text matching fails. “The function signature changed” and “the method contract was modified” are semantically equivalent but lexically distant.

We used an LLM as an automated judge. Given the reference expectations and generated output, the judge counted how many expected points were covered, evaluating intent and meaning, not phrasing. This LLM-as-judge technique is standard in generative AI evaluation and enables semantic scoring at the scale of thousands of data points without manual review for each result.

Metrics Evaluated

All metrics are evaluated against a golden dataset – a curated set of expected outputs for each MR. This means results are heavily dependent on the quality and completeness of the golden set expectations.

5. Outcomes & Conclusions

Radar Overview

Summary Table – Mean per Group

Key Findings

Finding 1: GKG Outperforms RAG for Code Review

GKG consistently outperformed RAG across coverage metrics:

Inline Comments Coverage: GKG 0.696 vs RAG 0.577 (+21%).
Issue Coverage: GKG 0.929 vs RAG 0.926 (marginal).
Summary Coverage: GKG 0.681 vs RAG 0.664 (+3%).

GKG’s AST-based symbol resolution enables precise identification of callers, references, and definitions — exactly what’s needed for thorough code review.

Finding 2: RAG Performs Worse Than BASELINE

Surprisingly, RAG underperformed the baseline (no context tools) on nearly all metrics:

Inline Comments Coverage: RAG 0.577 vs BASELINE 0.658 (-12%).
Summary Coverage: RAG 0.664 vs BASELINE 0.675 (-2%).
Score In Range: RAG 0.570 vs BASELINE 0.646 (-12%).
Issue Precision: RAG 0.422 vs BASELINE 0.438 (-4%).

Root causes identified:
Noise introduction: Vector similarity retrieves code that “looks similar” but isn’t relevant, distracting the model.
False positives: RAG finds process_user_data when searching for process_data.
Per-file limitation: RAG’s AST chunking is per-file only – no cross-file relationship understanding.
Distraction effect: Additional context can mislead rather than help when it’s not precisely relevant.

Finding 3: Understanding Issue Coverage vs Precision

BASELINE achieved the highest issue precision (0.438) while GKG had the lowest (0.411). However, this requires careful interpretation:

BASELINE generates fewer issues: Higher precision often means fewer total issues returned, not more accurate issues.
GKG finds more issues: Lower precision may indicate GKG is finding legitimate issues not anticipated in the golden set.
Golden set limitation: Precision penalises finding valid issues that weren’t in the expected list.

Key insight: A reviewer that generates fewer issues will naturally have higher precision (fewer chances to “miss” the golden set), but this doesn’t mean it’s more accurate — it may simply be more conservative.

Finding 4: GKG Uses More Tool Calls Effectively

GKG averaged 3.27 tool calls per review vs RAG’s 1.23 calls. The additional calls translate to better coverage because:

get_references finds exact callers/callees.
repo_map provides structural overview.
Each call returns precisely relevant code, not semantic approximations.

Finding 5: GKG outperforms GKG + RAG

GKG+RAG uses fewer tool calls than either GKG or RAG alone (1.92 GKG + 0.44 RAG), but doesn’t show improved metrics over the individual approaches. GKG alone still performs best on summary coverage and issue coverage.

Finding 6: Results Depend Heavily on the Golden Dataset

All metrics are evaluated against a curated golden dataset of expected outputs. This introduces important limitations:

Incomplete expectations: The golden set may not capture all valid issues a reviewer could find
Subjectivity: What constitutes a “matching” issue is determined by an LLM judge, introducing variability
Bias towards conservative reviewers: Reviewers that generate fewer outputs will score higher on precision even if they miss valid findings
Coverage ceiling: Coverage can only reach 100% if the reviewer finds exactly what was expected — novel valid findings don’t improve coverage

These results should be interpreted as alignment with expectations, not absolute measures of review quality.

Why GKG Works Better for Code Review

Recommendations

For AI Code Review Tools:

1. Use AST-based tools like GKG for code context retrieval – they provide the precision needed for structural queries.
2. Avoid RAG for symbol resolution tasks – vector similarity cannot reliably distinguish definitions from calls, or find exact references.
3. Consider BASELINE for simple MRs – the diff alone is often sufficient, and adding noisy context can hurt.
4. Measure before adding context – more context isn’t always better; precision matters as much as coverage.

When RAG Might Be Better

While RAG underperformed for code review, it may still be valuable for:

Documentation search: Finding relevant README sections or comments.
Conceptual queries: “How does authentication work in this codebase?”.
Pattern discovery: Finding code that implements similar concepts (not exact symbols).
Fallback option: When GKG is unavailable or for languages without AST parser support.

Latency and Cost Trade-offs

Beyond quality metrics, production deployments must consider latency (how long each review takes) and cost (token consumption). We measured both across all 79 MRs for each approach.

Latency Analysis

Review duration varies significantly based on the approach used:

Latency Insights

BASELINE is 2x faster than augmented approaches (~44s vs ~80-99s).
GKG is faster than RAG (80.6s vs 90.8s) despite making more tool calls (3.27 vs 1.23) — local graph traversal outperforms network-based vector search.
GKG+RAG is slowest at ~99s, combining overhead of both methods with no quality benefit.
High variance exists across MRs (8s to 310s) depending on complexity and context needs.

Cost Analysis (Token Usage)

Token consumption directly impacts operational costs. We measured input and output tokens for each approach:

Cost estimates are illustrative, based on GPT-4 pricing ($0.03/1K input tokens, $0.06/1K output tokens), and intended to show relative differences between approaches rather than exact costs.

Cost Insights

BASELINE is 4x cheaper than augmented approaches ($0.58 vs $1.82-$2.37).
GKG is most expensive at $2.37/MR due to richer context from code graph queries.
GKG+RAG uses fewer tokens than GKG alone (57K vs 75K) — the combined approach is more selective, but this doesn’t translate to better quality.
Output tokens are similar across all approaches (~2-4K), indicating review length is consistent regardless of context method.

Cost-Quality Trade-off

When considering both quality and cost, the picture becomes clearer:

Bottom Line: If cost is a concern, use BASELINE – it’s 4x cheaper and faster with acceptable quality. If quality is paramount, use GKG – the 4x cost increase delivers measurable improvements. Avoid RAG and GKG+RAG – they cost more than baseline without delivering better results.

Conclusion

Our empirical evaluation across 79 merge requests confirms that RAG is not ideal for code review context retrieval. RAG performed worse than the baseline on nearly all metrics – including inline comments coverage, summary coverage, and score accuracy – demonstrating that adding noisy context can be counterproductive.

The structural nature of code – with its precise symbol definitions, references, and relationships – requires tools that understand code structure, not just semantic similarity. GKG’s AST-based approach provides the precision needed for effective code review, enabling the AI to accurately identify callers, understand function signatures, and trace code relationships.

From a cost-efficiency perspective, GKG justifies its 4x cost premium through measurable quality improvements. RAG, however, costs 3x more than baseline while delivering worse results – a clear anti-pattern for production deployments.

Important Caveats

These results should be interpreted with the following limitations in mind:

Golden set dependency: All metrics measure alignment with a curated set of expected outputs, not absolute review quality.
Precision interpretation: Higher precision may indicate fewer issues generated rather than more accurate issues.
LLM judge subjectivity: Semantic matching introduces variability in what counts as a “match”.
Sample size: 79 MRs provides directional guidance but may not capture all edge cases.

Key takeaway: For AI code review tools, invest in AST-aware code intelligence (like GKG) rather than relying on vector similarity search. When GKG isn’t available or cost is constrained, the baseline (diff-only) approach often outperforms RAG at a fraction of the cost.

What’s Next

This evaluation was a starting point, not an endpoint. With GKG validated as the right foundation, the focus shifts from which approach to how well.

Refining the reviewer with real developer signal. The evaluation is grounded in expert-annotated ground truth, but the ultimate measure of quality is whether developers find the feedback useful. The next step is closing that loop – sampling reviews across teams and gathering structured ratings from the engineers who received them. This turns the evaluation from a one-off experiment into a continuous feedback system.

GKG going GA. GKG is currently in beta, running as a sidecar that re-indexes the codebase on every pipeline run. When GitLab ships it as a native CI/CD feature with persistent indexing, the latency overhead drops significantly – making GKG viable at higher MR volumes without the current infrastructure cost. That’s when the cost story changes, and when wider rollout becomes the obvious next step.

AI as My Learning Partner: An Apprentice’s Perspective

One of the most helpful ways AI supports me is by assisting with referencing and research for academic writing. When working on written assignments or apprenticeship coursework, AI helps me quickly understand how to structure references correctly and identify credible academic sources. It saves time when creating and organising reference lists or bibliographies, ensuring that sources are consistently formatted and follow the same referencing style throughout.

By reducing the time spent organising references, searching for examples, or worrying about small technical details, AI allows me to work more efficiently. Rather than getting stuck on how to present references, I can focus my attention on understanding the subject, developing my arguments, and engaging more deeply with the content. This shift has not only improved my productivity but has also made academic writing feel less stressful, as the overall process feels clearer and more manageable from the start.

Whether I’m planning a piece of work, organising my thoughts before starting a task, or improving something I’ve already written, AI acts as a second pair of eyes.

AI also supports my learning by helping to explain things clearly. If I come across a term, process, or concept I don’t fully grasp, I can use AI to get a clear and simple explanation. Often this involves breaking complex ideas down into steps or explaining them in a different way to how they were originally presented. This has been especially useful after workshops or classes, as it allows me to build understanding quickly and engage with the material more confidently.

Another key benefit is how AI helps with structure and clarity. Whether I’m planning a piece of work, organising my thoughts before starting a task, or improving something I’ve already written, AI acts as a second pair of eyes. It helps me refine what I’ve written so that my main points come across more clearly.

At Compare the Market (CtM), the internal courses and workshops on responsible and ethical use of AI have been particularly valuable. We’re encouraged to use AI thoughtfully and transparently, as a support rather than a shortcut. This includes being clear about when and how we use AI, carefully checking its outputs, and ensuring that our final work meets a high standard. This guidance has helped me build good habits early in my career and understand how AI fits into a professional environment.

Overall, AI has become a supportive tool in my apprenticeship journey. It’s helped me learn faster, feel more confident, and approach both work and study with less pressure. CtM’s guidance on responsible AI has helped me build skills that will be valuable both now and throughout my career.

AI as My Always-Available Pair Programmer: An Apprentice’s Perspective

It’s 3:45pm. Everyone’s in a meeting. I’m staring at a Spark error I’ve never seen before, and the stack trace is twenty lines of Java noise.

Six months ago, that would have meant an hour of trawling Stack Overflow, second‑guessing myself, and quietly hoping someone would come back online before end of day. Now? I open Cascade, paste the traceback, and within seconds I have a clear explanation, a likely root cause, and a suggested fix I can actually test.

That shift – from stuck to moving – is what AI tools have given me during my apprenticeship. Not a shortcut. A springboard.

A bit of context

I’m coming to the end of a Level 7 AI and Data Science apprenticeship at Compare the Market. My goal has been to transition into ML engineering – building and deploying models in production, not just training them in notebooks.

That means picking up a lot of new tooling fast: Spark, Delta Lake, feature stores, CI/CD pipelines, containerisation, API development. The list goes on.

The learning curve is real. And so, at times, is impostor syndrome.

How I actually use AI day‑to‑day

I lean on Cascade (Windsurf’s AI assistant) as my primary tool, integrated directly into my IDE. It’s not just autocomplete, it’s a genuine thinking partner. Here’s what a typical day looks like.

Context gathering across unfamiliar codebases

When I’m dropped into a repository I’ve never seen before, my first move is to point Cascade at it. It helps me map the structure, understand dependencies, and get a guided tour faster than any README ever could.

What would have taken an afternoon of careful reading can take minutes. For example, instead of hunting manually, I can ask it to identify where a dataset is produced and written (job config → transform module → Delta write), highlight where key parameters are set (partition keys, schema enforcement, environment config), and point me to the relevant tests and how they’re run in CI.

I still verify everything by reading the code, running locally, and checking behaviour against our pipelines but I get to the right neighbourhood much faster.

I switched to ChatGPT, reframed the problem, and it cracked it. Knowing which tool to reach for, and when, is a skill in itself.

MCP servers for the repetitive stuff

We’ve set up MCP (Model Context Protocol) servers that connect Cascade to internal tools like Jira and Confluence. Writing tickets, updating documentation, and pulling context from existing pages (tasks that used to eat into deep‑focus time), can now be handled through a quick prompt in my editor.

It’s not glamorous but reclaiming those 10‑minute interruptions adds up.

Knowing when to switch models

Here’s something I’ve learned the hard way: not every AI is best at everything.

On one project, I was doing research and context gathering using Claude. It got me 80% of the way there, excellent at synthesis and reasoning. But when it came to a particular implementation step, it kept going in circles.

I switched to ChatGPT, reframed the problem, and it cracked it. Knowing which tool to reach for, and when, is a skill in itself.

What it’s actually done for my confidence

The biggest change isn’t productivity, it’s willingness to try.

Before, I’d hesitate before tackling something unfamiliar. Do I really understand enough to attempt this? Should I wait and ask someone tomorrow?

Now I just start. If I hit a wall, I have an always‑available collaborator that doesn’t judge, doesn’t get tired of explaining, and doesn’t mind if I ask the same question three different ways.

That safety net has made me bolder, and boldness compounds. The more you try, the more you learn. The more you learn, the less you need the safety net.

I want to be clear: AI doesn’t do the thinking for me. I still have to understand why a solution works, verify it against our codebase, write the tests, and own the outcome.

The build is mine. AI just provides the scaffolding while I’m constructing it.

Using AI responsibly at CtM

Compare the Market actively encourages AI adoption, but with guardrails, and I think that’s exactly right.

A few principles I follow:

Critical thinking first: AI output is a starting point, not a final answer. I regularly ask for edge cases and failure modes, and I sanity‑check suggestions against our code and our standards.

Data sensitivity: knowing what you can and can’t share with external AI tools is non‑negotiable. Internal data stays internal; I redact and paraphrase when I need help with a problem.

MCP as a model for responsible integration: rather than ad‑hoc AI usage, connecting tools like Jira and Confluence through MCP servers means AI operates within defined, auditable workflows. It’s structured, it’s intentional and it scales.

The technology will keep evolving. But the core skill, knowing how to ask the right questions, evaluate the answers, and apply them thoughtfully, that’s timeless.

To anyone on the fence

If you’re early in your career, whether that’s an apprenticeship, a career change, or your first engineering role, AI tools are worth investing time in.

Not because they’ll do your job for you, but because they’ll help you learn faster, get unstuck sooner and build confidence in your own abilities.

The technology will keep evolving. But the core skill, knowing how to ask the right questions, evaluate the answers, and apply them thoughtfully, that’s timeless.

Start with a problem you’re stuck on. Open the chat. Ask.

You might be surprised how quickly you start moving.

We’re committed to tackling the cost of living crisis in the UK

We’ve all seen it in the news… the cost of living is rising across the UK and the impact on young people is worrying. We’ve teamed up with our new charity partner MyBnk — who help young people to take control of their finances — to better understand what the financial world means for young adults right now.

Here’s what they’re thinking…

With fuel costs set to hit all-time highs, the coming months are almost certainly going to be a challenge for many people. And while all sectors of society are going to be affected by rising living costs, young people will be particularly hard hit.

Almost 50% of young people have what we call ‘low financial resilience’ which means they are unable to absorb the shock of a random bill or unexpected expense. In addition to this, 84% of young people have received no financial education at all, an increase of 17% since the pandemic. This means that a whole generation of 18-year olds are entering independent living without all the information and skills they need to successfully live alone. Combine all of this with an unprecedented global living costs crisis and we have the potential for a disaster amongst a large proportion of our society.

At MyBnk, we’re working hard to reverse this trend and build a society of well-informed, financially aware young people who are able to navigate all the systems they need to stay safe and well, no matter what their other challenges might be. We believe that a well-informed young person makes better financial choices, better understands their responsibilities and generally engages more with their money issues instead of burying their head in the sand.

We bring the world of money to life — through engaging materials, expert trainers and programmes of work tailored to all age ranges, we start to encourage people to talk about their money — breaking the taboo that you shouldn’t speak about money with strangers and therefore modelling that asking for help from creditors, or other advisory services, is normal and ok.

We make sure our work is relevant to the learners at the life stage they’re at, and that we’re clear and understood by young people when exploring complex topics such as taxation or the pros and cons of borrowing.

But we all have a collective responsibility to understand the challenges young people are facing as they begin to enter independence. Opening money conversations with young people means that they are more likely to ask for help when they need it — whether it’s because they don’t understand what they’re hearing in the news, or because they have made a money mistake and need some support.

Together with Compare the Market, we will start to break down barriers for young people, helping them to understand the current news and what a cost of living and energy crisis may mean for them.

The media is full of negative stories about how much today’s situation will hurt pockets up and down the country — and whilst this is true, we believe that early intervention, along with the right approach, can help our learners to be more confident about their situation and the challenges they face, preparing them for the turbulent road ahead as we face economic uncertainty.

So, what can you do?

– Book money management programmes for your young people. Our young adult money workshops The Money House and Money Works focus heavily on bills and living costs to help 16–25 year olds transition into independence. Available virtually and in-person. You can request sessions here.

– Signpost young people to our free resources: We have a dedicated website section aimed at 16–25 year olds. We’ve also just launched season two of our online webcast, focused on the energy crisis.

– Start the conversation! Talking money is essential to break the taboo around the topic and encourage young people to seek help for financial issues. Tackling the topic at home or with the young people you work with is a great way to build their confidence and encourage further conversations.

Compare the Market are partnering with MyBnk to make great financial decision making a breeze for the next generation. The partnership will reach 70,000 young people across the UK with expert-led financial education.

Take a look at this video to find out more about our work together.

Young people receiving financial education

Meerkat Your Skills

Large organisations’ reliance on data and analytics is becoming increasingly visible in the workspace and the skillsets required to analyse data through innovative methods are vitally important. Unfortunately, many charities and social programmes don’t have access to these types of skillsets.

comparethemarket.com (CtM) is acutely aware of this, and as such, many of our data analysts partake in Meerkat Your Skills — part of our CSR programme, aimed at using our skills to support good causes. The CtM Data team has many capabilities such as engineering, Business Intelligence, analytics, insights, optimisation, and data science. These teams work together to provide many data solutions to the business. CtM prides itself in using data to make better decisions. Modern technologies such as PowerBI and Databricks are entrenched in the business.

Our most recent event was held in December, where different teams from around the business including Data, Tech and Marketing supported a couple of charities and shared their expertise. CtM teams made the training and upskilling for each charity bespoke — to ensure it was relevant and actionable.

The Data team engaged with a charity called The Kite Trust, which supports the wellbeing of LGBTQ+ young people within Cambridgeshire and Peterborough. The ambassadors scoped out the requirements and completed project briefs. We learned that The Kite Trust needed help with understanding their geographic footprint and where there was opportunity for more group sessions.

The workday was successful and consisted of an initial meeting with the Data team volunteers and members of The Kite Trust to ensure the volunteers had enough context and understanding of the charity. After getting a good grasp on the data available and the project briefs, the volunteers set off working on their projects.

Photo by Brooke Cagle on Unsplash

The team of analysts used mapping algorithms and spatial analytics to determine where access to community support groups was limited. These insights, using tools such as Databricks would have been difficult to gauge without a team of analysts. The Kite Trust will use these actionable recommendations to determine where support groups need to be enhanced or expanded to.

It’s not news that companies engage with community outreach programs. Holding events such as Meerkat Your Skills is very rewarding and leads to positive gains for both the charities and employees. This was a great opportunity to enhance the teams’ skills — working on a new dataset, completing different types of projects, and collaborating with members of various teams within Data. We enjoyed the experience and learned from the charity and are excited to do this again in the future.

Implementing a new data science and analytics platform Part 2

Recap.

We went through a non-exhaustive list of requirements for a good data platform, that we used to shortlist two solutions for a POC: Databricks and AWS Sagemaker.

In the part 1, I introduced our journey towards the implementation of a Data Science and Analytics platform. I explained that a data driven company needs to consider many aspects, from hiring good talent to investing in a new data platform.

Databricks is a software platform that helps its customers unify their analytics across the business, data science, and data engineering. It provides a Unified Analytics Platform for data science teams to collaborate with data engineering and lines of business to build data products. Find more details here.

Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environments. Find more details here. This was already available internally, as CTM uses AWS as cloud solution.

Photo by Austin Distel on Unsplash

POC ran for a duration of a month, during which we assessed the functionalities of both solutions, and validated against each other as well as the current environment where relevant.

Methodology

The POC was divided into 2 main parts:

Architecture & devops assessment.

End to end testing.

Architecture & devops assessment

In this part, the focus was on the platform deployment and administration. We created an isolated AWS account, identical to the main account we use for our daily tasks. We then went ahead with the deployment of Databricks, which we found straightforward. The tests were evaluated against the following categories:

Deployment: How easy it is to deploy Databricks within AWS.

Administration: What features are available for the platform admin, how effective they are.

Tools & Features: Are the available tools capable of covering all our daily tasks.

Performance: Query performance, job performance.

Integration with external services.

End-to-end testing

This is where the solution has been tested in much finer detail, developing, and productionising a Machine Learning model with Databricks and Sagemaker.

Since Databricks casts a wider net than just Machine Learning applications, we have arranged for a 2-day Hackathon that involved various teams within the Data Function to go over a scripted task list, predefined by the representatives of each team (insights, Analytics, Data Science, etc).

This part was evaluated using a scorecard that rolled up into various categories such as:

Photo by JESHOOTS.COM on Unsplash

Productivity & Workspace: ease of use, platform performance, stability of the environment.

Collaboration: Collaboration with other users, sharing results and dashboards.

Analytics: Data manipulation, visualisation, and data export.

Data Science: machine learning lifecycle management.

Note: List above is not an exhaustive one, just a high-level overview

Each team member was expected to score various tasks under each category. These were then discussed, to understand the reasons behind them and averaged where relevant to get an idea of which solution was preferred by the team. (Databricks, Sagemaker or current way).

Implementing a new data science and analytics platform

How do you choose a strong solution for your business? Which platforms are best? There’s lots to consider:

Be more data driven! This is a sentence we hear more and more. The boom of big data technologies have opened the doors to possibilities we could never have imagined before. The affordability of these solutions makes advanced analytics and data science available to all.

However, having a strong data science and analytics foundation requires a lot of aspects to be taken into consideration and investments to be made:

Hire new talent.

Review the internal technology stack and potentially invest in new technologies.

Put in place a proper governance around data related activities.

Compare the Market started this journey many years ago, by hiring data specialists (data engineers, data analysts, data scientists etc) and implementing new infrastructures (Hadoop initially).

This has led to the rapid growth of our data activities, with many positive results (massive processing of unstructured data, machine learning at scale).

However, a few years ago we decided to build on this by implementing a unified data science and analytics platform as this was easier to maintain, more cost effective and flexible for the work we were doing.

This was a long project, but after 10 months of work, it is now complete!

Over the course of a series of blog posts, I’ll share our learnings from the implementation.

I’ll cover:

Initial decisions.

Proof Of Concept.

Architecture design.

Implementation & Onboarding.

Note: What we are sharing in this blog is not THE way to implement a data science and analytics platform, but a solution that was fitting our context.

The problem

Although the business had invested in many data tools, there was no enterprise platform in place for delivering advanced analytics and data science. Given the growing size of the team, and the amount of incoming projects, we decided to look for a solution to enable collaboration between the team members, and allow to ingest new projects.

Photo by airfocus on Unsplash

A few of our requirements were:

Scalability: Not just kit wise but also scalability of the team to take on more projects including upskilling, onboarding, collaborating etc.

End-to-end functionality: Ability to tackle various tasks on the platform, through standardised methods/kit, without depending on external resources.

Collaboration: Facilitate joint work on a project.

Skills gap: A platform providing skills stacks able to cover the major roles (from data analytics to data science to machine learning engineering).

PII / Sensitive Data controls: Meet the data governance and security requirements.

As a result, we have reviewed and engaged several vendors to explore market offerings and find a suitable partner to help us deliver the new platform.

We have done extensive research on the vendors listed in Gartner’s Latest Magic Quandrant for Data Science and ML Platforms and added some others we have interacted with during the last couple of years.

Some of the leaders in the magic quadrant were ruled out due to:

Op model: Proprietary software/licensing with high pricing, vendor lock-in and/or skew to on-premises infrastructure that would be inflexible to changes in our internal infrastructure.

Performance on key capabilities such as collaboration, advanced analytics & ML Ops and scalability.

Finally we landed on two options to run a POC:

AWS Sagemaker: Already available solution as CTM uses AWS as cloud solution.

Databricks: For the list of features it provides.

The POC ran for one month during which we have developed and tested functionalities available on the platforms. This is the topic of the next series, where we will see how we organised the POC to be sure to have an objective result and make an informed decision on the way forward.