This tutorial details how to create a GraphRAG (Graph-based Retrieval Augmented Generation) to conduct economic data analysis. It will focus on combining World Bank Data with Unstructured Reports.

Introduction

In today’s data-driven world, economic analysts are plagued with information in various forms. This can create a significant challenge in being able to extract valuable insights that are scattered across structured databases and unstructured documents. While the World Bank’s World Development Indicators (WDI) provide rich quantitative data, the context and explanations for economic trends often lie within IMF reports, OECD analyses, and policy papers. Traditional Retrieval-Augmented Generation (RAG) systems struggle to connect these disparate information sources effectively.

GraphRAG, or Ontology-based RAG (OBR), improves this approach by using knowledge graphs to create explicit relationships between structured data points and unstructured textual information. This integration enables sophisticated economic analysis that would be otherwise be difficult and time-consuming with conventional methods.

Why GraphRAG for Economic Analysis?

Limitations of Traditional RAG

Traditional RAG systems face several challenges when dealing with economic data:

Relationship Blindness: Questions like “What economic policies contributed to Brazil’s GDP growth in 2020, and how do they compare to Argentina’s approach?” require understanding complex relationships between countries, policies, and indicators that traditional RAG cannot easily traverse.
Context Fragmentation: Economic indicators in isolation provide limited insight. Understanding why inflation spiked in a particular country requires connecting quantitative data with policy decisions, external shocks, and historical context found in reports.
Multi-hop Reasoning: Analyzing regional economic patterns or policy spillover effects requires connecting multiple data points and documents that may not be explicitly linked in traditional systems.

Think of it as creating a comprehensive map of economic relationships. Traditional methods are like having separate city maps for different neighborhoods, which is useful individually but lacking the connections between areas. GraphRAG creates the complete metropolitan map, showing how economic indicators in one country relate to policy decisions, how regional trends connect across borders, and how institutional analyses provide context for quantitative patterns.

GraphRAG Advantages

GraphRAG addresses these limitations through several key innovations:

Creating Explicit Relationships: Connecting countries, indicators, time periods, policies, and events in a structured graph.
Enabling Complex Queries: Supporting questions that require traversing multiple relationships and data sources.
Providing Provenance: Offering clear paths from questions to source data and documents.

Now let’s get to the fun part, let’s see how we can build a GraphRAG system that incorporates real-world data with economic and policy analysis to answer complex questions.

Tutorial: Building an Economic Analysis GraphRAG System

Let’s build a practical GraphRAG system that combines World Bank WDI data with unstructured economic reports and analyses. If you’d like to copy the demo you can find the gitlab repo here: GraphRAG Tutorial Repo.

Prerequisites and Setup

Before we begin, let’s understand what tools we’ll be using and why:

Neo4j: Graph database for storing entities and relationships.
Qdrant: Vector database for semantic search over documents (alternatives: Milvus, Weaviate, Elastisearch).
spaCy: Natural language processing for entity extraction.
LangChain: Framework for local LLM integration and text processing, we will also show the Google Cloud integration to use Gemini.
World Bank API: Source for structured economic data.

bash

# Install required packages
pip install neo4j pandas requests langchain langchain-ollama python-dotenv spacy transformers sentence-transformers qdrant-client google-genai dotenv pdfplumber

# download the spaCy english model pipeline
python -m spacy download en_core_web_sm

# Install required packages
pip install neo4j pandas requests langchain langchain-ollama python-dotenv spacy transformers sentence-transformers qdrant-client google-genai dotenv pdfplumber

# download the spaCy english model pipeline
python -m spacy download en_core_web_sm

Best Practices for File Organization and Environment Configuration

While this tutorial consolidates all code into a single file for simplicity, production applications should follow modular design principles. Splitting your project into separate files improves debugging, testing, and maintainability. These are essential practices for scalable data applications.

For a real-world example of proper file organization, check out the Data Sense GitLab repository, which demonstrates how to structure your project files effectively.

Managing API Keys and Configuration with Environment Variables

This project uses environment variables to securely store sensitive information like API keys and database URLs. We’ve implemented the dotenv library to load configuration data from an environment file, which should contain:

Neo4j database credentials and connection URLs
Qdrant vector database API keys and endpoints
Google API keys and service URLs

Pro tip: Whether you’re using cloud services or local development environments, you can easily switch between configurations by updating the values in your environment file—no code changes required.

Throughout this tutorial, you’ll see these environment variables accessed using the following code pattern:

Python

os.getenv("VARIABLE")

os.getenv("VARIABLE")

Step 1: Environment Setup

Import Libraries

Python

import pandas as pd
import numpy as np
from neo4j import GraphDatabase
import spacy
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from langchain.text_splitter import RecursiveCharacterTextSplitter
from google import genai
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import requests
import os
from typing import List, Dict, Tuple
import json
from datetime import datetime
import re

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

import pandas as pd
import numpy as np
from neo4j import GraphDatabase
import spacy
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from langchain.text_splitter import RecursiveCharacterTextSplitter
from google import genai
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import requests
import os
from typing import List, Dict, Tuple
import json
from datetime import datetime
import re

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

Setting Up Gemini API for Your Project

To use Gemini, install the Google Cloud CLI and connect it to your Google Cloud project. Follow the official Google Cloud CLI installation guide for detailed setup instructions.

This tutorial requires the Gemini API for text processing and embeddings. While Vertex AI offers additional embedding models, we’ll use Gemini’s standard embedding model for simplicity and compatibility.

Quick Setup Steps:

Install Google Cloud CLI
Configure gcloud with your project credentials
Enable Gemini API access
Obtain your API key for authentication

Initialize our Core Components

Python

# spaCy for Natural Language Processing (NLP) for entity extraction from documents
nlp = spacy.load("en_core_web_sm")

# Google client for Google's embedding model and Gemini LLM
google_client = genai.Client(api_key=os.getenv('GOOGLE_API_KEY')) 

# Local embeddings and LLM with Ollama
ollama_embeddings = OllamaEmbeddings(model=os.getenv('OLLAMA_EMBEDDINGS_MODEL'), base_url=os.getenv('OLLAMA_BASE_URL')) 
ollama_llm = OllamaLLM(model=os.getenv('OLLAMA_MODEL'), base_url=os.getenv('OLLAMA_BASE_URL'), temperature=0.1). # Set the temperature low for more factual, consistent responses

# Database Connections
# Neo4j for graph storage; stores entities and relationships 
neo4j_driver = GraphDatabase.driver( os.getenv('NEO4J_URL'), auth=(os.getenv("NEO4J_USER"), os.getenv("NEO4J_PASSWORD")) )

# Qdrant for vector storage; enables semantic search over documents
qdrant_client = QdrantClient(url=os.getenv('QDRANT_ENDPOINT'), port=6333, api_key=os.getenv('QDRANT_API_KEY'))

# spaCy for Natural Language Processing (NLP) for entity extraction from documents
nlp = spacy.load("en_core_web_sm")

# Google client for Google's embedding model and Gemini LLM
google_client = genai.Client(api_key=os.getenv('GOOGLE_API_KEY')) 

# Local embeddings and LLM with Ollama
ollama_embeddings = OllamaEmbeddings(model=os.getenv('OLLAMA_EMBEDDINGS_MODEL'), base_url=os.getenv('OLLAMA_BASE_URL')) 
ollama_llm = OllamaLLM(model=os.getenv('OLLAMA_MODEL'), base_url=os.getenv('OLLAMA_BASE_URL'), temperature=0.1). # Set the temperature low for more factual, consistent responses

# Database Connections
# Neo4j for graph storage; stores entities and relationships 
neo4j_driver = GraphDatabase.driver( os.getenv('NEO4J_URL'), auth=(os.getenv("NEO4J_USER"), os.getenv("NEO4J_PASSWORD")) )

# Qdrant for vector storage; enables semantic search over documents
qdrant_client = QdrantClient(url=os.getenv('QDRANT_ENDPOINT'), port=6333, api_key=os.getenv('QDRANT_API_KEY'))

Why this setup?

Dual storage approach: Graph database for structured relationships, vector database for semantic similarity.
Low-temperature LLM: Reduces hallucinations for factual economic analysis.
Compact embedding model: Balances quality with performance for production use.

Step 2: Collecting Structured Data from World Bank

Building a World Bank Data Collector for Economic Analytics

Now let’s create a data collector for the World Bank’s World Development Indicators (WDI), which will serve as our primary source of structured economic data. The WDI database contains over 1,400 time series indicators covering global development metrics including GDP, population, education, health, and environmental data across 200+ countries.

Understanding the World Bank API

The World Bank API provides access to thousands of economic indicators. Each API call follows this pattern:

Base URL: https://api.worldbank.org/v2
Structure: /country/{country_code}/indicator/{indicator_code}
Format: JSON responses with metadata and data arrays

Building the Data Collector

Python

class WorldBankDataCollector:
    """
    Collects data from World Bank WDI API.

    This class handles API calls, error handling, and data normalization to create clean pandas dataframes for graph
    construction.
    """

    def __init__(self):
        self.base_url = "https://api.worldbank.org/v2"

    def get_indicators(self, countries: List[str], indicators: List[str],
                       start_year: int = 2010, end_year: int = 2024) -> pd.DataFrame:
        """
        Get the WDI data for specified countries and indicators.

        :param countries: List of countries to collect data for.
        :param indicators: List of indicators to collect data for.
        :param start_year: Beginning year to collect data for.
        :param end_year: Ending year to collect data for.
        :return: Pandas DataFrame with columns: country_code, country_name, indicator_code, indicator_name, year, value
        """
        all_data = []

        # Iterate through countries and indicators to construct API urls
        for country in countries:
            for indicator in indicators:
                url = f"{self.base_url}/country/{country}/indicator/{indicator}"

                params = {
                    'data': f"{start_year}:{end_year}",
                    'format': 'json',
                    'per_page': 1000,
                }

                try:
                    response = requests.get(url, params=params)
                    data = response.json()

                    # WB WDI API returns [metadata, data]
                    if len(data) > 1 and data[1]:
                        for row in data[1]:
                            # Extract relevant data
                            all_data.append({
                                'country_code': row['country']['id'],
                                'country_name': row['country']['value'],
                                'indicator_code': row['indicator']['id'],
                                'indicator_name': row['indicator']['value'],
                                'year': row['date'],
                                'value': row['value']
                            })
                except requests.exceptions.RequestException as e:
                    print(f"Error fetching {indicator} for {country}: {e}")

        return pd.DataFrame(all_data)

class WorldBankDataCollector:
    """
    Collects data from World Bank WDI API.

    This class handles API calls, error handling, and data normalization to create clean pandas dataframes for graph
    construction.
    """

    def __init__(self):
        self.base_url = "https://api.worldbank.org/v2"

    def get_indicators(self, countries: List[str], indicators: List[str],
                       start_year: int = 2010, end_year: int = 2024) -> pd.DataFrame:
        """
        Get the WDI data for specified countries and indicators.

        :param countries: List of countries to collect data for.
        :param indicators: List of indicators to collect data for.
        :param start_year: Beginning year to collect data for.
        :param end_year: Ending year to collect data for.
        :return: Pandas DataFrame with columns: country_code, country_name, indicator_code, indicator_name, year, value
        """
        all_data = []

        # Iterate through countries and indicators to construct API urls
        for country in countries:
            for indicator in indicators:
                url = f"{self.base_url}/country/{country}/indicator/{indicator}"

                params = {
                    'data': f"{start_year}:{end_year}",
                    'format': 'json',
                    'per_page': 1000,
                }

                try:
                    response = requests.get(url, params=params)
                    data = response.json()

                    # WB WDI API returns [metadata, data]
                    if len(data) > 1 and data[1]:
                        for row in data[1]:
                            # Extract relevant data
                            all_data.append({
                                'country_code': row['country']['id'],
                                'country_name': row['country']['value'],
                                'indicator_code': row['indicator']['id'],
                                'indicator_name': row['indicator']['value'],
                                'year': row['date'],
                                'value': row['value']
                            })
                except requests.exceptions.RequestException as e:
                    print(f"Error fetching {indicator} for {country}: {e}")

        return pd.DataFrame(all_data)

Selecting Key Economic Indicators for Country Analysis

Next, we’ll define a curated list of economic indicators that provide comprehensive insights into a country’s economic performance. The following key indicators offer a well rounded view of economic health and development trends:

Python

# Initialize our data collector
wb_collector = WBDataCollector()

# Key economic indicators - these codes represent specific WDI metrics
indicators = [
    'NY.GDP.MKTP.KD.ZG',  # GDP growth (annual %) - economic growth
    'FP.CPI.TOTL.ZG',     # Inflation, consumer prices (annual %) - price stability
    'SL.UEM.TOTL.ZS',     # Unemployment, total (% of total labor force) - labor market
    'NE.TRD.GNFS.ZS',     # Trade (% of GDP) - economic openness
    'GC.DOD.TOTL.GD.ZS'   # Central government debt, total (% of GDP) - fiscal health
]

# Focus on Latin American countries for this example
# Using ISO 2-letter country codes
countries = [
    'BR',  # Brazil
    'AR',  # Argentina  
    'CL',  # Chile
    'CO',  # Colombia
    'MX',  # Mexico
    'PE'   # Peru
]

# Initialize our data collector
wb_collector = WBDataCollector()

# Key economic indicators - these codes represent specific WDI metrics
indicators = [
    'NY.GDP.MKTP.KD.ZG',  # GDP growth (annual %) - economic growth
    'FP.CPI.TOTL.ZG',     # Inflation, consumer prices (annual %) - price stability
    'SL.UEM.TOTL.ZS',     # Unemployment, total (% of total labor force) - labor market
    'NE.TRD.GNFS.ZS',     # Trade (% of GDP) - economic openness
    'GC.DOD.TOTL.GD.ZS'   # Central government debt, total (% of GDP) - fiscal health
]

# Focus on Latin American countries for this example
# Using ISO 2-letter country codes
countries = [
    'BR',  # Brazil
    'AR',  # Argentina  
    'CL',  # Chile
    'CO',  # Colombia
    'MX',  # Mexico
    'PE'   # Peru
]

We can test the WDI data collector with the following code to make sure it’s working and get a preview of the data.

Python

# Scrape the data
wb_collector = WorldBankDataCollector()
wdi_data = wb_collector.get_indicators(countries, indicators)

# Clean the data
wdi_data = wdi_data.dropna(subset=['value'])

# Print a preview
print("\nSample data:")
print(wdi_data.head())
print(wdi_data.describe())

# Scrape the data
wb_collector = WorldBankDataCollector()
wdi_data = wb_collector.get_indicators(countries, indicators)

# Clean the data
wdi_data = wdi_data.dropna(subset=['value'])

# Print a preview
print("\nSample data:")
print(wdi_data.head())
print(wdi_data.describe())

Why these indicators?

GDP Growth: Shows economic expansion/contraction over time.
Inflation: Indicates monetary policy effectiveness and price stability.
Unemployment: Reflects labor market health and social conditions.
Trade: Shows economic integration and competitiveness.
Government Debt: Indicates fiscal sustainability and policy space.

Data Quality Considerations

When working with World Bank economic indicators, several data quality factors require attention to ensure accurate analysis.

Handling Missing Data World Bank datasets often contain gaps for specific countries or years. Rather than removing incomplete records, consider these data imputation techniques:

Forward fill (ffill) – carries last known value forward
Backward fill (bfill) – uses next available value to fill gaps
Rolling averages (rolling) – smooths data using neighboring time periods

While this tutorial removes NaN values for simplicity, production analyses should evaluate which imputation method best fits your dataset’s characteristics.

Data Revision and Versioning – World Bank indicators undergo regular revisions as new information becomes available. Timestamping your data collections ensures reproducibility and tracks when specific values were captured.

Cross-Country Methodology Differences – Economic indicators may use varying calculation methodologies across countries, potentially limiting direct comparability. For this tutorial, we assume consistent methodology across our selected countries, though real-world analyses should account for these methodological differences when drawing conclusions.

Step 3: Processing Unstructured Documents

Integrating Unstructured Economic Data Sources

Now we’ll incorporate unstructured data sources including IMF reports, OECD analyses, and economic research papers. The primary challenge involves breaking these documents into meaningful chunks while preserving contextual relationships. These processed chunks become vector embeddings stored in our database, enabling efficient semantic search and retrieval.

Understanding Document Chunking for RAG Systems

Document chunking plays a critical role in GraphRAG implementation for several key reasons:

Token Limit Management – Large Language Models have strict context windows, requiring documents to be divided into digestible segments that fit within token constraints.

Enhanced Semantic Search – Chunking complete thoughts or concepts rather than arbitrary text blocks improves retrieval accuracy by maintaining semantic coherence within each segment.

Precise Information Retrieval – Smaller, focused chunks enable more targeted searches, allowing the system to surface exactly relevant information rather than entire documents.

Key Chunking Decisions

Chunk Size (1000 chars): Large enough for context, small enough for precision
Overlap (200 chars): Prevents important information from being split across chunks
Separator Priority: Preserves document structure by preferring paragraph breaks
Metadata Preservation: Each chunk retains source information for provenance

Defining the Document Collection and Chunking Class

Below we define our class to collect and chunk documents from various economic data sources.

Python

class DocumentCollector:
    """
    Processes unstructured documents into chunks suitable for graph construction
    and vector embedding.
    
    The chunking strategy balances between maintaining semantic coherence
    and keeping chunks small enough for effective retrieval.
    """
    
    def __init__(self):
        # RecursiveCharacterTextSplitter tries different separators in order
        # This preserves document structure better than simple character splitting
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,        # Target chunk size in characters
            chunk_overlap=200,      # Overlap prevents context loss at boundaries
            separators=[            # Try these separators in order:
                "\n\n",             # Paragraph breaks (preferred)
                "\n",               # Line breaks
                ". ",               # Sentence endings
                "! ",               # Exclamations
                "? "                # Questions
            ]
        )
    
    def process_document(self, text: str, doc_metadata: Dict) -> List[Dict]:
        """
        Split document into chunks with preserved metadata.
        
        Args:
            text: Raw document text content
            doc_metadata: Document information (source, title, date, etc.)
            
        Returns:
            List of chunk dictionaries with content and metadata
        """
        # Split text into chunks using our strategy
        chunks = self.text_splitter.split_text(text)
        
        processed_chunks = []
        for i, chunk in enumerate(chunks):
            # Create chunk with unique ID and inherited metadata
            chunk_data = {
                'content': chunk,
		 'doc_id': doc_metadata.get('doc_id'),
                'chunk_id': f"{doc_metadata.get('doc_id')}_chunk_{i}",
                'source': doc_metadata.get('source'), # IMF, OECD, etc.
                'title': doc_metadata.get('title'),
                'date': doc_metadata.get('date', ''),
                'doc_type': doc_metadata.get('doc_type', 'report')
            }
            processed_chunks.append(chunk_data)
        
        return processed_chunks

class DocumentCollector:
    """
    Processes unstructured documents into chunks suitable for graph construction
    and vector embedding.
    
    The chunking strategy balances between maintaining semantic coherence
    and keeping chunks small enough for effective retrieval.
    """
    
    def __init__(self):
        # RecursiveCharacterTextSplitter tries different separators in order
        # This preserves document structure better than simple character splitting
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,        # Target chunk size in characters
            chunk_overlap=200,      # Overlap prevents context loss at boundaries
            separators=[            # Try these separators in order:
                "\n\n",             # Paragraph breaks (preferred)
                "\n",               # Line breaks
                ". ",               # Sentence endings
                "! ",               # Exclamations
                "? "                # Questions
            ]
        )
    
    def process_document(self, text: str, doc_metadata: Dict) -> List[Dict]:
        """
        Split document into chunks with preserved metadata.
        
        Args:
            text: Raw document text content
            doc_metadata: Document information (source, title, date, etc.)
            
        Returns:
            List of chunk dictionaries with content and metadata
        """
        # Split text into chunks using our strategy
        chunks = self.text_splitter.split_text(text)
        
        processed_chunks = []
        for i, chunk in enumerate(chunks):
            # Create chunk with unique ID and inherited metadata
            chunk_data = {
                'content': chunk,
		 'doc_id': doc_metadata.get('doc_id'),
                'chunk_id': f"{doc_metadata.get('doc_id')}_chunk_{i}",
                'source': doc_metadata.get('source'), # IMF, OECD, etc.
                'title': doc_metadata.get('title'),
                'date': doc_metadata.get('date', ''),
                'doc_type': doc_metadata.get('doc_type', 'report')
            }
            processed_chunks.append(chunk_data)
        
        return processed_chunks

Sample Economic Documents for Analysis

For this tutorial, we’ll use sample economic documents that represent typical data sources encountered in financial analysis. These documents mirror real-world formats from institutions like the IMF, OECD, and central banks.

The GitLab repository and YouTube tutorial include an OCR function for PDF processing, enabling extraction of text from scanned economic reports and research papers.

Python

# Example documents - in practice, you'd load these from PDFs or web scraping
sample_documents = [
{
'doc_id': 'imf_brazil_2022',
'title': 'IMF Article IV Consultation: Brazil 2022',
'source': 'IMF',
'date': '2022-07-15',
'doc_type': 'country_report',
'content': """
Brazil's economy showed resilience in 2021-2022 despite global challenges. GDP growth reached 4.6% in 2021, supported by fiscal stimulus and commodity prices. However, inflation pressures emerged, reaching 10.1% by end-2021, prompting aggressive monetary tightening by the Central Bank. The fiscal situation remains challenging with government debt at 88% of GDP. Key structural reforms in labor markets and pension systems have supported medium-term growth prospects. External vulnerabilities remain contained with adequate international reserves and flexible exchange rate regime.
"""
},
{
'doc_id': 'oecd_latam_2023',
'title': 'OECD Economic Outlook: Latin America 2023',
'source': 'OECD',
'date': '2023-03-20',
'doc_type': 'regional_analysis',
'content': """
Latin American economies face headwinds from global financial tightening and China's slowdown. Regional growth is projected to slow to 1.3% in 2023. Argentina continues to grapple with high inflation exceeding 100% and currency pressures. Chile's economy contracted due to social unrest impacts and mining sector challenges. Mexico benefits from nearshoring trends and strong US demand. Structural challenges include low productivity growth, income inequality, and climate adaptation needs across the region.
"""
}
]

# Example documents - in practice, you'd load these from PDFs or web scraping
sample_documents = [
    {
        'doc_id': 'imf_brazil_2022',
        'title': 'IMF Article IV Consultation: Brazil 2022',
        'source': 'IMF',
        'date': '2022-07-15',
        'doc_type': 'country_report',
        'content': """
            Brazil's economy showed resilience in 2021-2022 despite global challenges. GDP growth reached 4.6% in 2021, supported by fiscal stimulus     and commodity prices. However, inflation pressures emerged, reaching 10.1% by end-2021, prompting aggressive monetary tightening by the Central Bank. The fiscal situation remains challenging with government debt at 88% of GDP. Key structural reforms in labor markets and pension systems have supported medium-term growth prospects. External vulnerabilities remain contained with adequate international reserves and flexible exchange rate regime.
        """
    },
    {
        'doc_id': 'oecd_latam_2023',
        'title': 'OECD Economic Outlook: Latin America 2023',
        'source': 'OECD',
        'date': '2023-03-20',
        'doc_type': 'regional_analysis',
        'content': """
            Latin American economies face headwinds from global financial tightening and China's slowdown. Regional growth is projected to slow to 1.3% in 2023. Argentina continues to grapple with high inflation exceeding 100% and currency pressures. Chile's economy contracted due to social unrest impacts and mining sector challenges. Mexico benefits from nearshoring trends and strong US demand. Structural challenges include low productivity growth, income inequality, and climate adaptation needs across the region.
        """
    }
]

Processing Documents into Chunks

Using our sample documents, we’ll process them into chunks with the DocumentCollector class and display a preview to understand the chunk structure and segmentation results.

Python

# Process all documents into chunks
doc_collector = DocumentCollector()
all_chunks = []

print("Processing documents into chunks...")
for doc in sample_documents:
    # Split each document and add to our collection
    chunks = doc_collector.process_document(doc['content'], doc)
    all_chunks.extend(chunks)
    print(f"Document '{doc['title']}' split into {len(chunks)} chunks")

print(f"\nTotal chunks created: {len(all_chunks)}")

# Preview a chunk to understand the structure
print("\nSample chunk:")
sample_chunk = all_chunks[0]
for key, value in sample_chunk.items():
    if key == 'content':
        print(f"{key}: {value[:100]}...")  # Truncate content for display
    else:
        print(f"{key}: {value}")

# Process all documents into chunks
doc_collector = DocumentCollector()
all_chunks = []

print("Processing documents into chunks...")
for doc in sample_documents:
    # Split each document and add to our collection
    chunks = doc_collector.process_document(doc['content'], doc)
    all_chunks.extend(chunks)
    print(f"Document '{doc['title']}' split into {len(chunks)} chunks")

print(f"\nTotal chunks created: {len(all_chunks)}")

# Preview a chunk to understand the structure
print("\nSample chunk:")
sample_chunk = all_chunks[0]
for key, value in sample_chunk.items():
    if key == 'content':
        print(f"{key}: {value[:100]}...")  # Truncate content for display
    else:
        print(f"{key}: {value}")

Example Chunk from IMF Report Processing

Below is an example chunk extracted from IMF reports using the pdfplumber library for PDF text extraction, demonstrating the chunk structure from real economic documents.

Console Output

Sample chunk:
content: opportunity to test innovations. In line with industry trends to incorporate technology and
business model innovations, it is recommended that regulators consider adjusting the testing
and risk monito...
doc_id: 1THAEA2019001-1.pdf
chunk_id: 1THAEA2019001-1.pdf_chunk_164
source: IMF
title: Thailand: Financial System Stability Assessment; IMF Country Report No. 19/308; September 10, 2019 
date: D:20191002123429-04'00'
author: 
doc_type: Report

Sample chunk:
content: opportunity to test innovations. In line with industry trends to incorporate technology and
business model innovations, it is recommended that regulators consider adjusting the testing
and risk monito...
doc_id: 1THAEA2019001-1.pdf
chunk_id: 1THAEA2019001-1.pdf_chunk_164
source: IMF
title: Thailand: Financial System Stability Assessment; IMF Country Report No. 19/308; September 10, 2019 
date: D:20191002123429-04'00'
author: 
doc_type: Report

Production-Ready Document Processing Enhancements

In a production system, leverage pdfplumber for OCR and text extraction from PDF documents. Alternatively, vision LLMs can handle complex document layouts and we’ve included this functionality in the repository.

For automated document collection, implement a web scraping class to gather reports directly from institutional websites like the IMF, World Bank, or OECD.

Document Types to Consider

Country Reports: IMF Article IV consultations and World Bank country studies.
Regional Analyses: OECD regional outlooks and regional development bank reports.
Policy Papers: Central bank communications and ministry publications.
Research Papers: Academic studies and think tank analyses.
News Articles: Financial Times, The Economist, Reuters, and Bloomberg articles.

Step 4: Designing the Knowledge Graph Schema

Designing the Graph Schema for Economic Analysis

The graph schema forms the foundation of our GraphRAG system, defining entity types and their relationships. A well-designed schema enables complex economic queries while maintaining intuitive navigation. We’ll use Neo4j to build our knowledge graph structure.

Core Entity Types for Economic Analysis

We’ll define our primary entity types based on economic analysis requirements, creating nodes and relationships that capture meaningful connections between economic concepts:

Python

"""
Economic Knowledge Graph Schema

ENTITIES (Nodes):
- Country: Geographic entities (Brazil, Argentina, etc.)
- Indicator: Economic metrics (GDP Growth, Inflation, etc.)  
- Year: Time periods for data points
- DataPoint: Specific country-indicator-year-value combinations
- Document: Report chunks from IMF, OECD, etc.
- EconomicConcept: Economic themes (fiscal policy, monetary policy, etc.)
- Event: Economic events (financial crisis, policy reform, etc.)

RELATIONSHIPS (Edges):
- (Country)-[:HAS_DATA_POINT]->(DataPoint)
- (DataPoint)-[:MEASURES]->(Indicator)
- (DataPoint)-[:FOR_YEAR]->(Year)
- (Document)-[:MENTIONS]->(Country)
- (Document)-[:DISCUSSES]->(EconomicConcept)
- (Document)-[:DESCRIBES]->(Event)
"""

"""
Economic Knowledge Graph Schema

ENTITIES (Nodes):
- Country: Geographic entities (Brazil, Argentina, etc.)
- Indicator: Economic metrics (GDP Growth, Inflation, etc.)  
- Year: Time periods for data points
- DataPoint: Specific country-indicator-year-value combinations
- Document: Report chunks from IMF, OECD, etc.
- EconomicConcept: Economic themes (fiscal policy, monetary policy, etc.)
- Event: Economic events (financial crisis, policy reform, etc.)

RELATIONSHIPS (Edges):
- (Country)-[:HAS_DATA_POINT]->(DataPoint)
- (DataPoint)-[:MEASURES]->(Indicator)
- (DataPoint)-[:FOR_YEAR]->(Year)
- (Document)-[:MENTIONS]->(Country)
- (Document)-[:DISCUSSES]->(EconomicConcept)
- (Document)-[:DESCRIBES]->(Event)
"""

Building the Graph Constructor

First, we establish database constraints for our Neo4j graph to ensure data integrity and optimize query performance:

Python

class EconGraphBuilder:
    """
    Constructs and populates the economic knowledge graph.

    Class handles:
    1) setting up graph constraints for data integrity.
    2) Creating nodes from structured WDI data.
    3) Extracting entities from unstructured documents.
    4) Building relationships between all entities.
    """

    def __init__(self, neo4j_driver = neo4j_driver):
        self.driver = neo4j_driver
        self.setup_constraints()

    def setup_constraints(self):
        """
        Create unique constraints to prevent duplicate entities.

        Constraints ensure data integrity and improve query performance by creating indexes on frequently accessed properties.
        """
        constraints = [
            # Each country has a unique code (BR, AR, etc.)
            "CREATE CONSTRAINT country_code IF NOT EXISTS FOR (c:Country) REQUIRE c.code IS UNIQUE",

            # Each indicator has a unique code (NY.GDP.MKTP.KD.ZG, etc.)
            "CREATE CONSTRAINT indicator_code IF NOT EXISTS FOR (i:Indicator) REQUIRE i.code IS UNIQUE",

            # Each year is unique
            "CREATE CONSTRAINT year_value IF NOT EXISTS FOR (y:Year) REQUIRE y.value IS UNIQUE",

            # Each document chunk has unique ID
            "CREATE CONSTRAINT document_id IF NOT EXISTS FOR (d:Document) REQUIRE d.chunk_id IS UNIQUE"
        ]

        with self.driver.session() as session:
            for constraint in constraints:
                try:
                    session.run(constraint)
                    print(f"✓ Created constraint: {constraint.split('(')[1].split(')')[0]}")
                except Exception as e:
                    print(f"⚠ Constraint might already exist: {e}")


class EconGraphBuilder:
    """
    Constructs and populates the economic knowledge graph.

    Class handles:
    1) setting up graph constraints for data integrity.
    2) Creating nodes from structured WDI data.
    3) Extracting entities from unstructured documents.
    4) Building relationships between all entities.
    """

    def __init__(self, neo4j_driver = neo4j_driver):
        self.driver = neo4j_driver
        self.setup_constraints()

    def setup_constraints(self):
        """
        Create unique constraints to prevent duplicate entities.

        Constraints ensure data integrity and improve query performance by creating indexes on frequently accessed properties.
        """
        constraints = [
            # Each country has a unique code (BR, AR, etc.)
            "CREATE CONSTRAINT country_code IF NOT EXISTS FOR (c:Country) REQUIRE c.code IS UNIQUE",

            # Each indicator has a unique code (NY.GDP.MKTP.KD.ZG, etc.)
            "CREATE CONSTRAINT indicator_code IF NOT EXISTS FOR (i:Indicator) REQUIRE i.code IS UNIQUE",

            # Each year is unique
            "CREATE CONSTRAINT year_value IF NOT EXISTS FOR (y:Year) REQUIRE y.value IS UNIQUE",

            # Each document chunk has unique ID
            "CREATE CONSTRAINT document_id IF NOT EXISTS FOR (d:Document) REQUIRE d.chunk_id IS UNIQUE"
        ]

        with self.driver.session() as session:
            for constraint in constraints:
                try:
                    session.run(constraint)
                    print(f"✓ Created constraint: {constraint.split('(')[1].split(')')[0]}")
                except Exception as e:
                    print(f"⚠ Constraint might already exist: {e}")

Creating Nodes from Structured Data

Next, we’ll implement a helper function to categorize our economic indicators, enabling better organization and navigation within our knowledge graph:

Python

    def _categorize_indicator(self, indicator_code: str) -> str:
        """Categorize indicators for better organization"""
        if 'GDP' in indicator_code or 'MKTP' in indicator_code:
            return 'Growth'
        elif 'CPI' in indicator_code or 'INF' in indicator_code:
            return 'Inflation'
        elif 'UEM' in indicator_code or 'EMP' in indicator_code:
            return 'Employment'
        elif 'TRD' in indicator_code or 'EXP' in indicator_code or 'IMP' in indicator_code:
            return 'Trade'
        elif 'DOD' in indicator_code or 'DEBT' in indicator_code:
            return 'Fiscal'
        else:
            return 'Other'

    def _categorize_indicator(self, indicator_code: str) -> str:
        """Categorize indicators for better organization"""
        if 'GDP' in indicator_code or 'MKTP' in indicator_code:
            return 'Growth'
        elif 'CPI' in indicator_code or 'INF' in indicator_code:
            return 'Inflation'
        elif 'UEM' in indicator_code or 'EMP' in indicator_code:
            return 'Employment'
        elif 'TRD' in indicator_code or 'EXP' in indicator_code or 'IMP' in indicator_code:
            return 'Trade'
        elif 'DOD' in indicator_code or 'DEBT' in indicator_code:
            return 'Fiscal'
        else:
            return 'Other'

Populating Knowledge Graph Nodes

We’ll now define the function to populate our knowledge graph nodes using the World Bank economic data collected earlier:

Python

    def create_structured_nodes(self, wdi_data: pd.DataFrame):
        """
        Create nodes and relationships from World Bank WDI data.

        This method transforms tabular data into a graph structure:
        Country -> DataPoint -> Indicator
                     |
                     v
                   Year

        """
        print("Creating structured data nodes...")

        with self.driver.session() as session:

            # 1) Create country nodes
            countries = wdi_data[['country_code', 'country_name']].drop_duplicates()
            print(f"Creating {len(countries)} country nodes...")
            for _, row in countries.iterrows():
                session.run(
                    """
                        MERGE (c:Country {code: $code})
                        SET c.name = $name
                    """,
                    code=row['country_code'],
                    name=row['country_name']
                )

            # 2) Create indicator nodes
            indicators = wdi_data[['indicator_code', 'indicator_name']].drop_duplicates()
            print(f"Creating {len(indicators)} indicator nodes...")
            for _, row in indicators.iterrows():
                session.run(
                    """
                        MERGE (i:Indicator {code: $code})
                        SET i.name = $name, i.category = $category
                    """,
                    code=row['indicator_code'],
                    name=row['indicator_name'],
                    category=self._categorize_indicator(row['indicator_code'])
                )

            # 3) Create Year Nodes
            years = wdi_data['year'].dropna().unique()
            print(f"Creating {len(years)} year nodes...")
            for year in years:
                session.run(
                    """
                        MERGE (y:Year {value: $year})
                    """,
                    year=int(year)
                )

            # 4) Create DataPoint nodes and relationships
            # -- filter out null values for clean data
            valid_data = wdi_data.dropna(subset=['value'])
            print(f"Creating {len(valid_data)} data points with relationships...")

            for counter, (_, row) in enumerate(valid_data.iterrows()):
                try:
                    # Create DataPoint with explicit property setting
                    session.run(
                        """
                        MATCH (c:Country {code: $country_code})
                        MATCH (i:Indicator {code: $indicator_code})
                        MATCH (y:Year {value: $year})

                        MERGE (dp:DataPoint {
                            country_code: $country_code,
                            indicator_code: $indicator_code,
                            year: $year
                        })
                        SET dp.value = $value,
                            dp.last_updated = datetime()

                        MERGE (c)-[:HAS_DATA_POINT]->(dp)
                        MERGE (dp)-[:MEASURES]->(i)
                        MERGE (dp)-[:FOR_YEAR]->(y)
                        """,
                        country_code=row['country_code'],
                        indicator_code=row['indicator_code'],
                        year=int(row['year']),
                        value=float(row['value'])
                    )
                except Exception as e:
                    print(f"Error processing row {i}: {e}")
                    print(f"Row data: {dict(row)}")
                    break

                    # Progress indicator for large datasets
                if counter % 100 == 0:
                    print(f"   Processed {counter}/{len(valid_data)} data points...")

                # Final progress update
                print(f"   Processed {len(valid_data)}/{len(valid_data)} data points...")

        print("✓ Structured data nodes created successfully")

    def create_structured_nodes(self, wdi_data: pd.DataFrame):
        """
        Create nodes and relationships from World Bank WDI data.

        This method transforms tabular data into a graph structure:
        Country -> DataPoint -> Indicator
                     |
                     v
                   Year

        """
        print("Creating structured data nodes...")

        with self.driver.session() as session:

            # 1) Create country nodes
            countries = wdi_data[['country_code', 'country_name']].drop_duplicates()
            print(f"Creating {len(countries)} country nodes...")
            for _, row in countries.iterrows():
                session.run(
                    """
                        MERGE (c:Country {code: $code})
                        SET c.name = $name
                    """,
                    code=row['country_code'],
                    name=row['country_name']
                )

            # 2) Create indicator nodes
            indicators = wdi_data[['indicator_code', 'indicator_name']].drop_duplicates()
            print(f"Creating {len(indicators)} indicator nodes...")
            for _, row in indicators.iterrows():
                session.run(
                    """
                        MERGE (i:Indicator {code: $code})
                        SET i.name = $name, i.category = $category
                    """,
                    code=row['indicator_code'],
                    name=row['indicator_name'],
                    category=self._categorize_indicator(row['indicator_code'])
                )

            # 3) Create Year Nodes
            years = wdi_data['year'].dropna().unique()
            print(f"Creating {len(years)} year nodes...")
            for year in years:
                session.run(
                    """
                        MERGE (y:Year {value: $year})
                    """,
                    year=int(year)
                )

            # 4) Create DataPoint nodes and relationships
            # -- filter out null values for clean data
            valid_data = wdi_data.dropna(subset=['value'])
            print(f"Creating {len(valid_data)} data points with relationships...")

            for counter, (_, row) in enumerate(valid_data.iterrows()):
                try:
                    # Create DataPoint with explicit property setting
                    session.run(
                        """
                        MATCH (c:Country {code: $country_code})
                        MATCH (i:Indicator {code: $indicator_code})
                        MATCH (y:Year {value: $year})

                        MERGE (dp:DataPoint {
                            country_code: $country_code,
                            indicator_code: $indicator_code,
                            year: $year
                        })
                        SET dp.value = $value,
                            dp.last_updated = datetime()

                        MERGE (c)-[:HAS_DATA_POINT]->(dp)
                        MERGE (dp)-[:MEASURES]->(i)
                        MERGE (dp)-[:FOR_YEAR]->(y)
                        """,
                        country_code=row['country_code'],
                        indicator_code=row['indicator_code'],
                        year=int(row['year']),
                        value=float(row['value'])
                    )
                except Exception as e:
                    print(f"Error processing row {i}: {e}")
                    print(f"Row data: {dict(row)}")
                    break

                    # Progress indicator for large datasets
                if counter % 100 == 0:
                    print(f"   Processed {counter}/{len(valid_data)} data points...")

                # Final progress update
                print(f"   Processed {len(valid_data)}/{len(valid_data)} data points...")

        print("✓ Structured data nodes created successfully")

Why this structure?

Separation of Concerns: Countries, indicators, and years are separate entities that can be reused.
Flexible Queries: Can easily find all data for a country and all years for an indicator.
Data Integrity: Constraints prevent duplicate entities and ensure referential integrity.
Performance: Unique constraints create indexes for fast lookups.
Extensibility: Easy to add new indicators or countries without schema changes.

Key Design Decisions

DataPoint as Central Entity: Represents the many-to-many relationship between countries, indicators, and years.
Categorical Organization: Indicators are categorized for easier navigation and filtering.
Timestamp Tracking: last_updated field helps with data freshness tracking.
Flexible Value Storage: Values stored as floats accommodate various economic metrics.

Creating Document Nodes for Unstructured Data

We still need to incorporate document nodes from our unstructured data sources into the knowledge graph. First, we’ll define a helper function to extract entities from text using spaCy’s natural language processing capabilities to identify relevant economic entities within documents.

Python

    def extract_entities_from_text(self, text: str) -> Tuple[List[str], List[str], List[str]]:
        """Extract countries, economic terms, and events from text"""
        nlp = spacy.load("en_core_web_sm")  # Used for extracting words from documents
        doc = nlp(text)

        countries = []
        economic_terms = []
        events = []

        # Define some known countries to aid in detection including the countries for our analysis plus other major economies
        known_countries = ['Brazil', 'Argentina', 'Chile', 'Colombia', 'Mexico', 'Peru', 'United States', 'United Kingdom', 'China', 'Germany', 'Japan', 'US', 'UK', 'France', 'Australia', 'Russia'])

        # Predefined economic terms to look for extraction
        economic_keywords = {
            'gdp', 'inflation', 'unemployment', 'fiscal', 'monetary', 'debt', 'growth', 'recession', 'stimulus',
            'reform', 'trade', 'exports', 'imports', 'deficit', 'surplus', 'policy', 'central bank', 'interest rates'
        }

        for ent in doc.ents:
            if ent.label_ == "GPE":  # Geopolitical entities
                if ent.text.lower() in known_countries:
                    countries.append(ent.text)
            elif ent.label_ in ["ORG", "EVENT"]:
                events.append(ent.text)

        # Extract economic terms
        for token in doc:
            if token.text.lower() in economic_keywords:
                economic_terms.append(token.text.lower())

        return list(set(countries)), list(set(economic_terms)), list(set(events))

    def extract_entities_from_text(self, text: str) -> Tuple[List[str], List[str], List[str]]:
        """Extract countries, economic terms, and events from text"""
        nlp = spacy.load("en_core_web_sm")  # Used for extracting words from documents
        doc = nlp(text)

        countries = []
        economic_terms = []
        events = []

        # Define some known countries to aid in detection including the countries for our analysis plus other major economies
        known_countries = ['Brazil', 'Argentina', 'Chile', 'Colombia', 'Mexico', 'Peru', 'United States', 'United Kingdom', 'China', 'Germany', 'Japan', 'US', 'UK', 'France', 'Australia', 'Russia'])

        # Predefined economic terms to look for extraction
        economic_keywords = {
            'gdp', 'inflation', 'unemployment', 'fiscal', 'monetary', 'debt', 'growth', 'recession', 'stimulus',
            'reform', 'trade', 'exports', 'imports', 'deficit', 'surplus', 'policy', 'central bank', 'interest rates'
        }

        for ent in doc.ents:
            if ent.label_ == "GPE":  # Geopolitical entities
                if ent.text.lower() in known_countries:
                    countries.append(ent.text)
            elif ent.label_ in ["ORG", "EVENT"]:
                events.append(ent.text)

        # Extract economic terms
        for token in doc:
            if token.text.lower() in economic_keywords:
                economic_terms.append(token.text.lower())

        return list(set(countries)), list(set(economic_terms)), list(set(events))

Building Document Nodes and Relationships

Now we’ll create the function that generates document nodes for our knowledge graph. This process creates nodes for each text chunk and establishes relationships based on entity mentions including countries, economic concepts, organizations, and events.

The system uses spaCy’s named entity recognition to match labels within each chunk. For economic concepts, we define a curated list of relevant terms that help identify chunks containing specific economic discussions, enabling precise relationship mapping between documents and structured data.

Python

    def create_document_nodes(self, chunks: List[Dict]):
        """Create document nodes and extract entities"""
        with self.driver.session() as session:
            for i, chunk in enumerate(chunks):
                print("Processing document chunks into nodes...")
                # Create document chunk node
                session.run(
                    """
                        MERGE (d:Document {doc_id: $doc_id})
                        SET d.title = $title, 
                            d.content = $content,
                            d.chunk_id = $chunk_id
                    """,
                    **chunk
                )

                # Extract and link entities
                countries, economic_terms, events = self.extract_entities_from_text(chunk["content"])

                # Link countries
                for country in countries:
                    session.run(
                        """
                            MATCH (d:Document {chunk_id: $chunk_id})
                            MATCH (c:Country {name: $country})
                            MERGE (d)-[:MENTIONS]->(c)
                        """,
                        chunk_id=chunk["chunk_id"],
                        country=country
                    )

                # Create economic concept nodes
                for term in economic_terms:
                    session.run(
                        """
                            MATCH (d:Document {chunk_id: $chunk_id})
                            MERGE (ec:EconomicConcept {name: $term})
                            MERGE (d)-[:DISCUSSES]->(ec)
                        """,
                        chunk_id=chunk["chunk_id"],
                        term=term
                    )

                # Create event nodes
                for event in events:
                    session.run(
                        """
                            MATCH (d:Document {chunk_id: $chunk_id})
                            MERGE (e:Event {name: $event})
                            MERGE (d)-[:DESCRIBES]->(e)
                        """,
                        chunk_id=chunk["chunk_id"],
                        event=event
                    )

                # Progress tracking
                if i % 100 == 0 or i == len(chunks) - 1:  # Update every 5 chunks since there are fewer documents
                    entities_found = len(countries) + len(economic_terms) + len(events)
                    print(f"Processed chunk {i + 1}/{len(chunks)} - Found {entities_found} entities")

        print("Document nodes and entity relationships created successfully")

    def create_document_nodes(self, chunks: List[Dict]):
        """Create document nodes and extract entities"""
        with self.driver.session() as session:
            for i, chunk in enumerate(chunks):
                print("Processing document chunks into nodes...")
                # Create document chunk node
                session.run(
                    """
                        MERGE (d:Document {doc_id: $doc_id})
                        SET d.title = $title, 
                            d.content = $content,
                            d.chunk_id = $chunk_id
                    """,
                    **chunk
                )

                # Extract and link entities
                countries, economic_terms, events = self.extract_entities_from_text(chunk["content"])

                # Link countries
                for country in countries:
                    session.run(
                        """
                            MATCH (d:Document {chunk_id: $chunk_id})
                            MATCH (c:Country {name: $country})
                            MERGE (d)-[:MENTIONS]->(c)
                        """,
                        chunk_id=chunk["chunk_id"],
                        country=country
                    )

                # Create economic concept nodes
                for term in economic_terms:
                    session.run(
                        """
                            MATCH (d:Document {chunk_id: $chunk_id})
                            MERGE (ec:EconomicConcept {name: $term})
                            MERGE (d)-[:DISCUSSES]->(ec)
                        """,
                        chunk_id=chunk["chunk_id"],
                        term=term
                    )

                # Create event nodes
                for event in events:
                    session.run(
                        """
                            MATCH (d:Document {chunk_id: $chunk_id})
                            MERGE (e:Event {name: $event})
                            MERGE (d)-[:DESCRIBES]->(e)
                        """,
                        chunk_id=chunk["chunk_id"],
                        event=event
                    )

                # Progress tracking
                if i % 100 == 0 or i == len(chunks) - 1:  # Update every 5 chunks since there are fewer documents
                    entities_found = len(countries) + len(economic_terms) + len(events)
                    print(f"Processed chunk {i + 1}/{len(chunks)} - Found {entities_found} entities")

        print("Document nodes and entity relationships created successfully")

Step 5: Setting up the Vector Store

Congratulations if you’ve made it this far, we’re almost done! If you’re enjoying the tutorial please like and subscribe to our posts on LinkedIn and YouTube. We appreciate your support!

Setting Up the Vector Store

For vector storage, we’ll use Qdrant vector database to enable semantic search across all document chunks. The system employs cosine similarity to measure text similarity between chunks, allowing for precise retrieval of contextually relevant economic information.

Python

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from langchain_ollama import OllamaEmbeddings
from google import genai
from google.genai.errors import ServerError
import os
from typing import List, Dict
from dotenv import load_dotenv
import time
from httpx import RemoteProtocolError

load_dotenv()

# Initialize the qdrant client
qdrant_client = QdrantClient(url=os.getenv('QDRANT_ENDPOINT'), port=6333, api_key=os.getenv('QDRANT_API_KEY'))
ollama_embeddings = OllamaEmbeddings(model=f"{os.getenv('OLLAMA_EMBEDDINGS_MODEL')}")
google_embeddings = genai.Client(api_key=os.getenv('GOOGLE_API_KEY'))

class VectorStoreManager:
    def __init__(self, qdrant_client = qdrant_client, embeddings = "google"):
        self.client = qdrant_client
        self.embeddings = embeddings
        self.collection_name = "economic_documents"
        self.setup_collection()

    def setup_collection(self):
        """Initialize Qdrant collection"""

        try:
            # We set the vector size to 3072 because that's what google embeddings returns, different embeddings models
            # may return different vector sizes and this would need to be adjusted to accommodate if using a local model.
            self.client.create_collection(
                collection_name=self.collection_name,
                vectors_config=VectorParams(size=3072, distance=Distance.COSINE),
            )
        except Exception as e:
            print(f"Collection might already exist: {e}")

    def embed_documents(self, chunks: List[Dict]):
        """Embed document chunks and store in vector database"""
        points = []

        for i, chunk in enumerate(chunks):
            # Create enriched content for embedding
            enriched_content = f"""
            Title: {chunk['title']}
            Source: {chunk['source']}
            Content: {chunk['content']}
            """

            processed_embeddings = []
            tries = True
            num_tries = 0
            embedding = ''
            # Generate embedding, if the google api fails we want to try a couple more times before moving on to the next chunk.
            if self.embeddings == "google":
                while tries:
                    try:
                        num_tries += 1
                        embedding = google_embeddings.models.embed_content(
                            model="gemini-embedding-001",
                            contents=enriched_content
                        )
                        tries = False
                    except (ServerError, RemoteProtocolError) as e:
                        if num_tries == 5:  # Move to next entry
                            tries = False
                            continue
                        print(f"Issue with Google Server: {e}")
                        time.sleep(30)  # Pause for 30 seconds before trying to create the embeddings again
                if embedding:
                    processed_embeddings = embedding.embeddings[0].values
            elif self.embeddings == "ollama":
                processed_embeddings = ollama_embeddings.embed_query(enriched_content)
            else:
                print(f"Embeddings model not available for embedding type: {self.embeddings}.")
                continue

            # Create point for Qdrant
            point = PointStruct(
                id=i,
                vector=processed_embeddings,
                payload={
                    'chunk_id': chunk['chunk_id'],
                    'content': chunk['content'],
                    'title': chunk['title'],
                    'source': chunk['source'],
                    'date': chunk['date'],
                    'doc_type': chunk['doc_type']
                }
            )
            points.append(point)

            # Progress tracking
            if i % 100 == 0 or i == len(chunks) - 1:  # Update every 5 chunks since there are fewer documents
                print(f"   Processed chunk {i + 1}/{len(chunks)}")

        with open(Path.cwd() / 'db' / 'points.p', 'wb') as f:
            pickle.dump(points, f)

        # Chunk the points to upsert into the Qdrant db to avoid issues with payload size
        for i in range(0, len(points), 10):
            points_to_upsert = points[i:i + 10]

            # Upload to Qdrant
            self.client.upsert(
                collection_name=self.collection_name,
                points=points_to_upsert
            )

        print(f"Embedded and stored {len(points)} document chunks")
        return None

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from langchain_ollama import OllamaEmbeddings
from google import genai
from google.genai.errors import ServerError
import os
from typing import List, Dict
from dotenv import load_dotenv
import time
from httpx import RemoteProtocolError

load_dotenv()

# Initialize the qdrant client
qdrant_client = QdrantClient(url=os.getenv('QDRANT_ENDPOINT'), port=6333, api_key=os.getenv('QDRANT_API_KEY'))
ollama_embeddings = OllamaEmbeddings(model=f"{os.getenv('OLLAMA_EMBEDDINGS_MODEL')}")
google_embeddings = genai.Client(api_key=os.getenv('GOOGLE_API_KEY'))

class VectorStoreManager:
    def __init__(self, qdrant_client = qdrant_client, embeddings = "google"):
        self.client = qdrant_client
        self.embeddings = embeddings
        self.collection_name = "economic_documents"
        self.setup_collection()

    def setup_collection(self):
        """Initialize Qdrant collection"""

        try:
            # We set the vector size to 3072 because that's what google embeddings returns, different embeddings models
            # may return different vector sizes and this would need to be adjusted to accommodate if using a local model.
            self.client.create_collection(
                collection_name=self.collection_name,
                vectors_config=VectorParams(size=3072, distance=Distance.COSINE),
            )
        except Exception as e:
            print(f"Collection might already exist: {e}")

    def embed_documents(self, chunks: List[Dict]):
        """Embed document chunks and store in vector database"""
        points = []

        for i, chunk in enumerate(chunks):
            # Create enriched content for embedding
            enriched_content = f"""
            Title: {chunk['title']}
            Source: {chunk['source']}
            Content: {chunk['content']}
            """

            processed_embeddings = []
            tries = True
            num_tries = 0
            embedding = ''
            # Generate embedding, if the google api fails we want to try a couple more times before moving on to the next chunk.
            if self.embeddings == "google":
                while tries:
                    try:
                        num_tries += 1
                        embedding = google_embeddings.models.embed_content(
                            model="gemini-embedding-001",
                            contents=enriched_content
                        )
                        tries = False
                    except (ServerError, RemoteProtocolError) as e:
                        if num_tries == 5:  # Move to next entry
                            tries = False
                            continue
                        print(f"Issue with Google Server: {e}")
                        time.sleep(30)  # Pause for 30 seconds before trying to create the embeddings again
                if embedding:
                    processed_embeddings = embedding.embeddings[0].values
            elif self.embeddings == "ollama":
                processed_embeddings = ollama_embeddings.embed_query(enriched_content)
            else:
                print(f"Embeddings model not available for embedding type: {self.embeddings}.")
                continue

            # Create point for Qdrant
            point = PointStruct(
                id=i,
                vector=processed_embeddings,
                payload={
                    'chunk_id': chunk['chunk_id'],
                    'content': chunk['content'],
                    'title': chunk['title'],
                    'source': chunk['source'],
                    'date': chunk['date'],
                    'doc_type': chunk['doc_type']
                }
            )
            points.append(point)

            # Progress tracking
            if i % 100 == 0 or i == len(chunks) - 1:  # Update every 5 chunks since there are fewer documents
                print(f"   Processed chunk {i + 1}/{len(chunks)}")

        with open(Path.cwd() / 'db' / 'points.p', 'wb') as f:
            pickle.dump(points, f)

        # Chunk the points to upsert into the Qdrant db to avoid issues with payload size
        for i in range(0, len(points), 10):
            points_to_upsert = points[i:i + 10]

            # Upload to Qdrant
            self.client.upsert(
                collection_name=self.collection_name,
                points=points_to_upsert
            )

        print(f"Embedded and stored {len(points)} document chunks")
        return None

Step 6: GraphRAG Query System

Building the GraphRAG Query Interface

We’re almost finished! The final component is our GraphRAG query class, which retrieves relevant information and data to provide context for LLM responses.

Extracting Entities from User Questions

Before querying the knowledge graph, we must extract appropriate entities from user questions. This process combines NLP techniques with preset keyword matching to identify indicator codes, indicator categories, and economic concepts within queries.

Python

class EconGraphRag:
    """
    Combines the knowledge graph with the vector search to conduct the graphRAG
    """
    def __init__(self, neo4j_driver = neo4j_driver, qdrant_client = qdrant_client,
                 provider = "google", llm_type = "google", collection_name = "economic_documents"):
        self.driver = neo4j_driver
        self.qdrant_client = qdrant_client
        self.embeddings = google_client if provider == "google" else ollama_embeddings
        self.provider = provider
        self.llm = google_client if llm_type == "google" else ollama_llm
        self.collection_name = collection_name

    def extract_query_entities(self, question: str) -> Dict[str, List[str]]:
        """Extract entities and concepts from user question"""
        nlp = spacy.load("en_core_web_sm")
        query = nlp(question)

        entities = {
            'countries': [],
            'indicator_codes': [],
            'indicator_categories': [],
            'years': [],
            'concepts': []
        }

        # Extract entities
        for ent in query.ents:
            if ent.label_ == "GPE":
                entities['countries'].append(ent.text.lower())
            elif ent.label_ == "DATE":
                # Simple year extraction
                year_match = re.search(r'\b(19|20)\d{2}\b', ent.text)
                if year_match:
                    entities['years'].append(year_match.group())


        # World Bank WDI Indicator Code Mappings
        wdi_indicators = {
            # GDP Growth indicators
            'gdp growth': 'NY.GDP.MKTP.KD.ZG',  # GDP growth (annual %)
            'economic growth': 'NY.GDP.MKTP.KD.ZG',

            # Inflation indicators
            'inflation': 'FP.CPI.TOTL.ZG',  # Inflation, consumer prices (annual %)
            'inflation rate': 'FP.CPI.TOTL.ZG',
            'consumer price index': 'FP.CPI.TOTL.ZG',  # Consumer price index (2010 = 100)
            'cpi': 'FP.CPI.TOTL.ZG',
            'price level': 'FP.CPI.TOTL.ZG',

            # Employment indicators
            'unemployment': 'SL.UEM.TOTL.ZS',  # Unemployment, total (% of total labor force)
            'unemployment rate': 'SL.UEM.TOTL.ZS',
            'not working': 'SL.UEM.TOTL.ZS'

            # Trade indicators
            'trade': 'NE.TRD.GNFS.ZS',  # Trade (% of GDP)
            'trade balance': 'NE.TRD.GNFS.ZS',
            'import and exports': 'NE.TRD.GNFS.ZS'

            # Fiscal indicators
            'government debt': 'GC.DOD.TOTL.GD.ZS',  # Central government debt, total (% of GDP)
            'debt': 'GC.DOD.TOTL.GD.ZS',
            'public debt': 'GC.DOD.TOTL.GD.ZS',
        }

        # Category mappings for broader searches
        indicator_categories = {
            'growth': 'Growth',
            'economic growth': 'Growth',
            'inflation': 'Inflation',
            'prices': 'Inflation',
            'employment': 'Employment',
            'unemployment': 'Employment',
            'jobs': 'Employment',
            'working': 'Employment',
            'trade': 'Trade',
            'exports': 'Trade',
            'imports': 'Trade',
            'trade balance': 'Trade',
            'fiscal': 'Fiscal',
            'debt': 'Fiscal',
            'government': 'Fiscal',
            'budget': 'Fiscal'
        }

        question_lower = question.lower()

        # Check for specific indicator code matches
        for indicator, code in wdi_indicators.items():
            if indicator in question_lower:
                entities['indicator_codes'].append(code)
                entities['concepts'].append(indicator)

        # Check for category matches
        for indicator, category in indicator_categories.items():
            if indicator in question_lower:
                entities['indicator_categories'].append(category)
                if indicator not in entities['concepts']:
                    entities['concepts'].append(indicator)

        # Extract country names and codes that might not be caught by NER
        common_countries = {
            'us': 'united states', 'usa': 'united states', 'america': 'united states',
            'uk': 'united kingdom', 'britain': 'united kingdom',
            'china': 'china', 'prc': 'china',
            'india': 'india',
            'germany': 'germany',
            'france': 'france',
            'japan': 'japan',
            'brazil': 'brazil',
            'russia': 'russia',
            'canada': 'canada',
            'australia': 'australia',
            'south korea': 'south korea', 'korea': 'south korea',
            'mexico': 'mexico',
            'italy': 'italy',
            'spain': 'spain',
            'netherlands': 'netherlands',
            'switzerland': 'switzerland',
            'sweden': 'sweden',
            'norway': 'norway',
            'denmark': 'denmark'
        }

        for country_variant, country_name in common_countries.items():
            if country_variant in question_lower and country_name not in entities['countries']:
                entities['countries'].append(country_name)

        # Remove duplicates while preserving order
        for key in entities:
            entities[key] = list(dict.fromkeys(entities[key]))
        return entities

class EconGraphRag:
    """
    Combines the knowledge graph with the vector search to conduct the graphRAG
    """
    def __init__(self, neo4j_driver = neo4j_driver, qdrant_client = qdrant_client,
                 provider = "google", llm_type = "google", collection_name = "economic_documents"):
        self.driver = neo4j_driver
        self.qdrant_client = qdrant_client
        self.embeddings = google_client if provider == "google" else ollama_embeddings
        self.provider = provider
        self.llm = google_client if llm_type == "google" else ollama_llm
        self.collection_name = collection_name

    def extract_query_entities(self, question: str) -> Dict[str, List[str]]:
        """Extract entities and concepts from user question"""
        nlp = spacy.load("en_core_web_sm")
        query = nlp(question)

        entities = {
            'countries': [],
            'indicator_codes': [],
            'indicator_categories': [],
            'years': [],
            'concepts': []
        }

        # Extract entities
        for ent in query.ents:
            if ent.label_ == "GPE":
                entities['countries'].append(ent.text.lower())
            elif ent.label_ == "DATE":
                # Simple year extraction
                year_match = re.search(r'\b(19|20)\d{2}\b', ent.text)
                if year_match:
                    entities['years'].append(year_match.group())


        # World Bank WDI Indicator Code Mappings
        wdi_indicators = {
            # GDP Growth indicators
            'gdp growth': 'NY.GDP.MKTP.KD.ZG',  # GDP growth (annual %)
            'economic growth': 'NY.GDP.MKTP.KD.ZG',

            # Inflation indicators
            'inflation': 'FP.CPI.TOTL.ZG',  # Inflation, consumer prices (annual %)
            'inflation rate': 'FP.CPI.TOTL.ZG',
            'consumer price index': 'FP.CPI.TOTL.ZG',  # Consumer price index (2010 = 100)
            'cpi': 'FP.CPI.TOTL.ZG',
            'price level': 'FP.CPI.TOTL.ZG',

            # Employment indicators
            'unemployment': 'SL.UEM.TOTL.ZS',  # Unemployment, total (% of total labor force)
            'unemployment rate': 'SL.UEM.TOTL.ZS',
            'not working': 'SL.UEM.TOTL.ZS'

            # Trade indicators
            'trade': 'NE.TRD.GNFS.ZS',  # Trade (% of GDP)
            'trade balance': 'NE.TRD.GNFS.ZS',
            'import and exports': 'NE.TRD.GNFS.ZS'

            # Fiscal indicators
            'government debt': 'GC.DOD.TOTL.GD.ZS',  # Central government debt, total (% of GDP)
            'debt': 'GC.DOD.TOTL.GD.ZS',
            'public debt': 'GC.DOD.TOTL.GD.ZS',
        }

        # Category mappings for broader searches
        indicator_categories = {
            'growth': 'Growth',
            'economic growth': 'Growth',
            'inflation': 'Inflation',
            'prices': 'Inflation',
            'employment': 'Employment',
            'unemployment': 'Employment',
            'jobs': 'Employment',
            'working': 'Employment',
            'trade': 'Trade',
            'exports': 'Trade',
            'imports': 'Trade',
            'trade balance': 'Trade',
            'fiscal': 'Fiscal',
            'debt': 'Fiscal',
            'government': 'Fiscal',
            'budget': 'Fiscal'
        }

        question_lower = question.lower()

        # Check for specific indicator code matches
        for indicator, code in wdi_indicators.items():
            if indicator in question_lower:
                entities['indicator_codes'].append(code)
                entities['concepts'].append(indicator)

        # Check for category matches
        for indicator, category in indicator_categories.items():
            if indicator in question_lower:
                entities['indicator_categories'].append(category)
                if indicator not in entities['concepts']:
                    entities['concepts'].append(indicator)

        # Extract country names and codes that might not be caught by NER
        common_countries = {
            'us': 'united states', 'usa': 'united states', 'america': 'united states',
            'uk': 'united kingdom', 'britain': 'united kingdom',
            'china': 'china', 'prc': 'china',
            'india': 'india',
            'germany': 'germany',
            'france': 'france',
            'japan': 'japan',
            'brazil': 'brazil',
            'russia': 'russia',
            'canada': 'canada',
            'australia': 'australia',
            'south korea': 'south korea', 'korea': 'south korea',
            'mexico': 'mexico',
            'italy': 'italy',
            'spain': 'spain',
            'netherlands': 'netherlands',
            'switzerland': 'switzerland',
            'sweden': 'sweden',
            'norway': 'norway',
            'denmark': 'denmark'
        }

        for country_variant, country_name in common_countries.items():
            if country_variant in question_lower and country_name not in entities['countries']:
                entities['countries'].append(country_name)

        # Remove duplicates while preserving order
        for key in entities:
            entities[key] = list(dict.fromkeys(entities[key]))
        return entities

Querying the Knowledge Graph for Economic Context

Next, we’ll define the function to query our knowledge graph for comprehensive economic information. The system searches for data related to specific countries mentioned in questions, plus relevant indicator codes or categories.

This approach handles both specific and broad economic queries. For example, “How has GDP growth changed over the past 5 years?” will capture relevant indicators and historical data across multiple countries.

The function also searches document nodes to find reports related to mentioned countries or economic concepts. This cross-country knowledge building ensures that events in one country (like the US) that may impact another (like Brazil) are captured and surfaced in responses.

Python

    def query_graph(self, entities: Dict[str, List[str]]) -> List[dict]:
        """Query the knowledge graph based on extracted entities, order by the year desc"""
        results = []

        with self.driver.session() as session:
            # Query for country data points
            if entities['countries']:
                for country in entities['countries']:
                    query = """
                        MATCH (c:Country)-[:HAS_DATA_POINT]->(dp:DataPoint)-[:MEASURE]->(i:Indicator)
                        MATCH (dp)-[:FOR_YEAR]->(y:Year)
                        WHERE toLower(c.name) CONTAINS $country or toLower(c.code) = $country
                    """
                    params = {'country': country}

                    # Filter by indicator codes if specified
                    if entities['indicator_codes']:
                        query += " AND i.code IN $indicator_codes"
                        params['indicator_codes'] = entities['indicator_codes']
                    elif entities['indicator_categories']:
                        query += " AND i.category IN $indicator_categories"
                        params['indicator_categories'] = entities['indicator_categories']

                    if entities['years']:
                        query += " AND y.value in $years"
                        params['years'] = [int(y) for y in entities['years']]

                    query += """
                        RETURN c.name as country, i.name as indicator, i.code as indicator_code,
                           i.category as category, y.value as year, dp.value as value, dp.last_updated as last_updated
                        ORDER BY dp.last_updated, y.value DESC
                        LIMIT 20
                    """

                    result = session.run(query, params)
                    results.extend([dict(record) for record in result])

            # Query for related documents
            if entities['countries'] or entities['concepts']:
                doc_query = """
                    MATCH (d:Document)
                    WHERE 
                    """
                conditions = []
                params = {}

                if entities['countries']:
                    conditions.append("""
                            EXISTS {
                                MATCH (d)-[:MENTIONS]->(c:Country)
                                WHERE ANY(country IN $countries WHERE toLower(c.name) CONTAINS country)
                            }
                        """)
                    params['countries'] = entities['countries']

                if entities['concepts']:
                    conditions.append("""
                            EXISTS {
                                MATCH (d)-[:DISCUSSES]->(ec:EconomicConcept)
                                WHERE ec.name IN $concepts
                            }
                        """)
                    params['concepts'] = entities['concepts']

                doc_query += " OR ".join(conditions)
                doc_query += """
                    RETURN d.chunk_id as chunk_id, d.title as title, 
                           d.source as source, d.content as content
                    LIMIT 10
                    """

                if conditions:
                    doc_result = session.run(doc_query, params)
                    doc_results = [dict(record) for record in doc_result]
                    results.extend(doc_results)

            # Query for indicators without specific countries
            if entities['indicator_codes'] or entities['indicator_categories']:
                query = """
                        MATCH (c:Country)-[:HAS_DATA_POINT]->(dp:DataPoint)-[:MEASURE]->(i:Indicator)
                        MATCH (dp)-[:FOR_YEAR]->(y:Year)
                        WHERE 1=1
                        """
                params = {}

                # Filter by specific indicator codes
                if entities['indicator_codes']:
                    query += " AND i.code IN $indicator_codes"
                    params['indicator_codes'] = entities['indicator_codes']
                # Filter by categories if no specific codes
                elif entities['indicator_categories']:
                    query += " AND i.category IN $indicator_categories"
                    params['indicator_categories'] = entities['indicator_categories']

                if entities['years']:
                    query += " AND y.value IN $years"
                    params['years'] = [int(y) for y in entities['years']]

                query += """
                                RETURN c.name as country, i.name as indicator, i.code as indicator_code,
                                       i.category as category, y.value as year, dp.value as value, dp.unit as unit
                                ORDER BY y.value DESC, c.name ASC
                                LIMIT 50
                            """

                result = session.run(query, params)
                results.extend([dict(record) for record in result])

        return results

    def query_graph(self, entities: Dict[str, List[str]]) -> List[dict]:
        """Query the knowledge graph based on extracted entities, order by the year desc"""
        results = []

        with self.driver.session() as session:
            # Query for country data points
            if entities['countries']:
                for country in entities['countries']:
                    query = """
                        MATCH (c:Country)-[:HAS_DATA_POINT]->(dp:DataPoint)-[:MEASURE]->(i:Indicator)
                        MATCH (dp)-[:FOR_YEAR]->(y:Year)
                        WHERE toLower(c.name) CONTAINS $country or toLower(c.code) = $country
                    """
                    params = {'country': country}

                    # Filter by indicator codes if specified
                    if entities['indicator_codes']:
                        query += " AND i.code IN $indicator_codes"
                        params['indicator_codes'] = entities['indicator_codes']
                    elif entities['indicator_categories']:
                        query += " AND i.category IN $indicator_categories"
                        params['indicator_categories'] = entities['indicator_categories']

                    if entities['years']:
                        query += " AND y.value in $years"
                        params['years'] = [int(y) for y in entities['years']]

                    query += """
                        RETURN c.name as country, i.name as indicator, i.code as indicator_code,
                           i.category as category, y.value as year, dp.value as value, dp.last_updated as last_updated
                        ORDER BY dp.last_updated, y.value DESC
                        LIMIT 20
                    """

                    result = session.run(query, params)
                    results.extend([dict(record) for record in result])

            # Query for related documents
            if entities['countries'] or entities['concepts']:
                doc_query = """
                    MATCH (d:Document)
                    WHERE 
                    """
                conditions = []
                params = {}

                if entities['countries']:
                    conditions.append("""
                            EXISTS {
                                MATCH (d)-[:MENTIONS]->(c:Country)
                                WHERE ANY(country IN $countries WHERE toLower(c.name) CONTAINS country)
                            }
                        """)
                    params['countries'] = entities['countries']

                if entities['concepts']:
                    conditions.append("""
                            EXISTS {
                                MATCH (d)-[:DISCUSSES]->(ec:EconomicConcept)
                                WHERE ec.name IN $concepts
                            }
                        """)
                    params['concepts'] = entities['concepts']

                doc_query += " OR ".join(conditions)
                doc_query += """
                    RETURN d.chunk_id as chunk_id, d.title as title, 
                           d.source as source, d.content as content
                    LIMIT 10
                    """

                if conditions:
                    doc_result = session.run(doc_query, params)
                    doc_results = [dict(record) for record in doc_result]
                    results.extend(doc_results)

            # Query for indicators without specific countries
            if entities['indicator_codes'] or entities['indicator_categories']:
                query = """
                        MATCH (c:Country)-[:HAS_DATA_POINT]->(dp:DataPoint)-[:MEASURE]->(i:Indicator)
                        MATCH (dp)-[:FOR_YEAR]->(y:Year)
                        WHERE 1=1
                        """
                params = {}

                # Filter by specific indicator codes
                if entities['indicator_codes']:
                    query += " AND i.code IN $indicator_codes"
                    params['indicator_codes'] = entities['indicator_codes']
                # Filter by categories if no specific codes
                elif entities['indicator_categories']:
                    query += " AND i.category IN $indicator_categories"
                    params['indicator_categories'] = entities['indicator_categories']

                if entities['years']:
                    query += " AND y.value IN $years"
                    params['years'] = [int(y) for y in entities['years']]

                query += """
                                RETURN c.name as country, i.name as indicator, i.code as indicator_code,
                                       i.category as category, y.value as year, dp.value as value, dp.unit as unit
                                ORDER BY y.value DESC, c.name ASC
                                LIMIT 50
                            """

                result = session.run(query, params)
                results.extend([dict(record) for record in result])

        return results

Adding Semantic Search for Comprehensive Retrieval

In addition to searching through our knowledge graph, we’ll conduct a semantic search to ensure we capture all the information and avoid missing any relevant economic context or relationships.

Python

    def semantic_search(self, question: str, limit: int = 10) -> List[Dict]:
        """Perform semantic search on document vectors"""
        if self.provider == "google":
            query_embedding = self.embeddings.models.embed_content(
                    model="gemini-embedding-001",
                    contents=question
                )
            if query_embedding:
                processed_embeddings = query_embedding.embeddings[0].values
            else:
                print("Embeddings not found.")
                return None
        else:
            processed_embeddings = self.embeddings.embed_query(question)

        search_results = self.qdrant_client.query_points(
            collection_name=self.collection_name,
            query=processed_embeddings,
            limit=limit
        )

        return [
            {
                'content': result.payload['content'],
                'title': result.payload['title'],
                'source': result.payload['source'],
                'score': result.score
            }
            for result in search_results.points
        ]

    def semantic_search(self, question: str, limit: int = 10) -> List[Dict]:
        """Perform semantic search on document vectors"""
        if self.provider == "google":
            query_embedding = self.embeddings.models.embed_content(
                    model="gemini-embedding-001",
                    contents=question
                )
            if query_embedding:
                processed_embeddings = query_embedding.embeddings[0].values
            else:
                print("Embeddings not found.")
                return None
        else:
            processed_embeddings = self.embeddings.embed_query(question)

        search_results = self.qdrant_client.query_points(
            collection_name=self.collection_name,
            query=processed_embeddings,
            limit=limit
        )

        return [
            {
                'content': result.payload['content'],
                'title': result.payload['title'],
                'source': result.payload['source'],
                'score': result.score
            }
            for result in search_results.points
        ]

Generating LLM Responses with Complete Context

Finally, we’ll implement functions that integrate all components and enable the LLM to generate comprehensive answers using context from both our knowledge graph and vector database.

Python

    def generate_answer(self, question: str, graph_results: List[Dict],
                        semantic_results: List[Dict]) -> str:
        """Generate final answer using LLM"""

        # Prepare context
        context_parts = []

        # Add graph-derived structured data
        if graph_results:
            context_parts.append("STRUCTURED DATA FROM KNOWLEDGE GRAPH:")
            for result in graph_results[:10]:  # Limit context size
                if 'value' in result:  # Numeric data
                    context_parts.append(
                        f"- {result.get('country', 'N/A')}: {result.get('indicator', 'N/A')} "
                        f"in {result.get('year', 'N/A')} was {result.get('value', 'N/A')}"
                    )
                else:  # Document data
                    context_parts.append(f"- {result.get('title', 'N/A')}: {result.get('content', 'N/A')[:200]}...")

        # Add semantic search results
        if semantic_results:
            context_parts.append("\nRELEVANT DOCUMENT EXCERPTS:")
            for result in semantic_results:
                context_parts.append(
                    f"- From '{result['title']}' ({result['source']}): {result['content'][:500]}..."
                )

        context = "\n".join(context_parts)

        prompt = f"""
            Based on the following structured economic data and document excerpts, 
            please provide a comprehensive answer to the question: "{question}"
    
            Available Context:
            {context}
    
            Please provide a detailed answer that:
            1. Uses specific data points when available
            2. Explains relationships between different economic indicators
            3. References sources appropriately
            4. Acknowledges any limitations in the available data
    
            Answer:
        """

        if self.provider == "google":
            response = self.llm.models.generate_content(
                model="gemini-2.5-flash",
                contents=prompt,
            )
        else:
            response = self.llm(prompt)
        return response

    def answer_question(self, question: str) -> str:
        """Main method to answer economic questions"""
        print(f"Processing question: {question}")

        # Extract entities from question
        entities = self.extract_query_entities(question)
        print(f"Extracted entities: {entities}")

        # Query knowledge graph
        graph_results = self.query_graph(entities)
        print(f"Found {len(graph_results)} graph results")

        # Perform semantic search
        semantic_results = self.semantic_search(question)
        print(f"Found {len(semantic_results)} semantic results")

        # Generate answer
        answer = self.generate_answer(question, graph_results, semantic_results)
        return answer

    def generate_answer(self, question: str, graph_results: List[Dict],
                        semantic_results: List[Dict]) -> str:
        """Generate final answer using LLM"""

        # Prepare context
        context_parts = []

        # Add graph-derived structured data
        if graph_results:
            context_parts.append("STRUCTURED DATA FROM KNOWLEDGE GRAPH:")
            for result in graph_results[:10]:  # Limit context size
                if 'value' in result:  # Numeric data
                    context_parts.append(
                        f"- {result.get('country', 'N/A')}: {result.get('indicator', 'N/A')} "
                        f"in {result.get('year', 'N/A')} was {result.get('value', 'N/A')}"
                    )
                else:  # Document data
                    context_parts.append(f"- {result.get('title', 'N/A')}: {result.get('content', 'N/A')[:200]}...")

        # Add semantic search results
        if semantic_results:
            context_parts.append("\nRELEVANT DOCUMENT EXCERPTS:")
            for result in semantic_results:
                context_parts.append(
                    f"- From '{result['title']}' ({result['source']}): {result['content'][:500]}..."
                )

        context = "\n".join(context_parts)

        prompt = f"""
            Based on the following structured economic data and document excerpts, 
            please provide a comprehensive answer to the question: "{question}"
    
            Available Context:
            {context}
    
            Please provide a detailed answer that:
            1. Uses specific data points when available
            2. Explains relationships between different economic indicators
            3. References sources appropriately
            4. Acknowledges any limitations in the available data
    
            Answer:
        """

        if self.provider == "google":
            response = self.llm.models.generate_content(
                model="gemini-2.5-flash",
                contents=prompt,
            )
        else:
            response = self.llm(prompt)
        return response

    def answer_question(self, question: str) -> str:
        """Main method to answer economic questions"""
        print(f"Processing question: {question}")

        # Extract entities from question
        entities = self.extract_query_entities(question)
        print(f"Extracted entities: {entities}")

        # Query knowledge graph
        graph_results = self.query_graph(entities)
        print(f"Found {len(graph_results)} graph results")

        # Perform semantic search
        semantic_results = self.semantic_search(question)
        print(f"Found {len(semantic_results)} semantic results")

        # Generate answer
        answer = self.generate_answer(question, graph_results, semantic_results)
        return answer

Congratulations ~ Your GraphRAG System is Complete!

You’ve successfully built a GraphRAG system for economic data analysis that combines structured World Bank indicators with unstructured economic documents. This foundation provides a solid base for expansion and customization.

Next Steps and Resources:

For PDF processing capabilities, watch our detailed YouTube video tutorial
Access the complete GraphRAG implementation on our GitLab repository with enhanced features
Explore additional data sources and expand your economic entity types
Implement automated document collection for real-time updates

Your system now enables sophisticated economic analysis by connecting quantitative data with qualitative insights from reports and research papers.

Testing the Economic GraphRAG

The hard work is complete and it’s time to see your GraphRAG system in action. This step is straightforward: define several test questions covering different economic scenarios and send them to the answer_question function of our EconomicGraphRAG class.

Sample Test Questions:

Country-specific economic performance queries
Cross-country comparative analyses
Historical trend questions spanning multiple years
Policy impact assessments combining structured and unstructured data

Watch as your system retrieves relevant data from both the knowledge graph and vector database to generate comprehensive, contextually-rich responses.

Sending Test Questions to Answer

Python

# Initialize GraphRAG system
graph_rag = EconGraphRag()

# Test questions
test_questions = [
    "What was Brazil's GDP growth in 2021 and what factors contributed to it?",
    "How does Argentina's inflation compare to other Latin American countries?",
    "What economic challenges are facing Latin America according to recent reports?",
    "Which countries had the highest government debt levels and what were the underlying causes?"
]

for question in test_questions:
    print(f"\n{'=' * 60}")
    print(f"QUESTION: {question}")
    print(f"{'=' * 60}")

    answer = graph_rag.answer_question(question)
    print(f"\nANSWER:\n{answer}")
    print(f"{'=' * 60}")

# Initialize GraphRAG system
graph_rag = EconGraphRag()

# Test questions
test_questions = [
    "What was Brazil's GDP growth in 2021 and what factors contributed to it?",
    "How does Argentina's inflation compare to other Latin American countries?",
    "What economic challenges are facing Latin America according to recent reports?",
    "Which countries had the highest government debt levels and what were the underlying causes?"
]

for question in test_questions:
    print(f"\n{'=' * 60}")
    print(f"QUESTION: {question}")
    print(f"{'=' * 60}")

    answer = graph_rag.answer_question(question)
    print(f"\nANSWER:\n{answer}")
    print(f"{'=' * 60}")

Advanced Features and Extensions

Advanced Features for Extended Analysis

For our more advanced readers, we’ve included additional functions to extend the GraphRAG system’s capabilities. These advanced features are covered in detail in our YouTube video tutorial.

Temporal Analysis for Economic Trends

Analyzing trends over time is fundamental in economic analysis. This function enables trend analysis for specific country-indicator combinations, integrating seamlessly into our EconomicGraphRAG class to provide historical insights and pattern recognition across economic datasets.

Python

def analyze_trends(self, country: str, indicator: str, years: List[int]) -> Dict:
    """Analyze trends over time for specific country-indicator combinations"""
    with self.driver.session() as session:
        query = """
        MATCH (c:Country {name: $country})-[:HAS_DATA_POINT]->(dp:DataPoint)-[:MEASURES]->(i:Indicator {name: $indicator})
        MATCH (dp)-[:FOR_YEAR]->(y:Year)
        WHERE y.value IN $years
        RETURN y.value as year, dp.value as value
        ORDER BY y.value
        """
        
        result = session.run(query, {
            'country': country, 
            'indicator': indicator, 
            'years': years
        })
        
        data = [dict(record) for record in result]
        
        # Calculate trend metrics
        if len(data) > 1:
            values = [d['value'] for d in data]
            trend = 'increasing' if values[-1] > values[0] else 'decreasing'
            avg_change = (values[-1] - values[0]) / (len(values) - 1)
            
            return {
                'data': data,
                'trend': trend,
                'average_change': avg_change,
                'volatility': np.std(values) if len(values) > 2 else 0
            }
        
        return {'data': data}

def analyze_trends(self, country: str, indicator: str, years: List[int]) -> Dict:
    """Analyze trends over time for specific country-indicator combinations"""
    with self.driver.session() as session:
        query = """
        MATCH (c:Country {name: $country})-[:HAS_DATA_POINT]->(dp:DataPoint)-[:MEASURES]->(i:Indicator {name: $indicator})
        MATCH (dp)-[:FOR_YEAR]->(y:Year)
        WHERE y.value IN $years
        RETURN y.value as year, dp.value as value
        ORDER BY y.value
        """
        
        result = session.run(query, {
            'country': country, 
            'indicator': indicator, 
            'years': years
        })
        
        data = [dict(record) for record in result]
        
        # Calculate trend metrics
        if len(data) > 1:
            values = [d['value'] for d in data]
            trend = 'increasing' if values[-1] > values[0] else 'decreasing'
            avg_change = (values[-1] - values[0]) / (len(values) - 1)
            
            return {
                'data': data,
                'trend': trend,
                'average_change': avg_change,
                'volatility': np.std(values) if len(values) > 2 else 0
            }
        
        return {'data': data}

Multi-Country Comparison Analysis

Another essential concept in economic analysis is comparative assessment between countries. Multi-country comparisons provide valuable insights when countries have similar economic structures but show diverging indicator performance, helping identify policy effectiveness and structural differences.

Python

def compare_countries(self, countries: List[str], indicator: str, year: int) -> Dict:
    """Compare multiple countries for a specific indicator and year"""
    with self.driver.session() as session:
        query = """
        MATCH (c:Country)-[:HAS_DATA_POINT]->(dp:DataPoint)-[:MEASURES]->(i:Indicator {name: $indicator})
        MATCH (dp)-[:FOR_YEAR]->(y:Year {value: $year})
        WHERE c.name IN $countries
        RETURN c.name as country, dp.value as value
        ORDER BY dp.value DESC
        """
        
        result = session.run(query, {
            'countries': countries,
            'indicator': indicator,
            'year': year
        })
        
        return [dict(record) for record in result]

def compare_countries(self, countries: List[str], indicator: str, year: int) -> Dict:
    """Compare multiple countries for a specific indicator and year"""
    with self.driver.session() as session:
        query = """
        MATCH (c:Country)-[:HAS_DATA_POINT]->(dp:DataPoint)-[:MEASURES]->(i:Indicator {name: $indicator})
        MATCH (dp)-[:FOR_YEAR]->(y:Year {value: $year})
        WHERE c.name IN $countries
        RETURN c.name as country, dp.value as value
        ORDER BY dp.value DESC
        """
        
        result = session.run(query, {
            'countries': countries,
            'indicator': indicator,
            'year': year
        })
        
        return [dict(record) for record in result]

Best Practices and Considerations

1. Data Quality and Validation

Implement data validation: Check for missing values, outliers, and inconsistencies in WDI data.
Document provenance: Track data sources and update timestamps.
Handle data revisions: World Bank data is frequently revised; implement versioning.

2. Graph Schema Evolution

Start simple: Begin with core entities (Country, Indicator, Year, Document).
Iterate incrementally: Add new entity types and relationships based on user needs.
Version control: Maintain schema versions to handle updates gracefully.

3. Performance Optimization

Index frequently queried properties: Country codes, indicator codes, years.
Limit context size: Prevent overwhelming the LLM with too much information.
Cache common queries: Store results for frequently asked questions.
Batch processing: Process documents and data updates in batches.

4. Evaluation and Monitoring

Implementing a simple function that checks the coherence and credibility of the information outputted by the LLM would be important in production. This quality control mechanism not only improves response accuracy but also provides feedback to refine knowledge graph construction, optimizing how nodes and edges are created for better information retrieval and analysis.

Conclusion

GraphRAG represents a significant advancement in economic data analysis by seamlessly combining structured quantitative data with unstructured textual information. This tutorial demonstrates how to build a practical system that can answer complex economic questions by leveraging the relationships between data points, countries, indicators, and policy documents.

The key advantages of this approach include:

Enhanced Context: Quantitative indicators gain meaning through policy explanations and economic analysis
Relationship Discovery: Uncover connections between countries, policies, and economic outcomes
Reduced Hallucinations: Ground responses in structured data relationships
Scalability: Easily add new data sources and entity types

As you implement this system, remember that the quality of your graph schema and entity extraction directly impacts the system’s effectiveness. Start with a focused domain (like Latin American economics) and gradually expand to other regions and economic topics.

The future of economic analysis lies in systems that can seamlessly traverse between “what happened” (quantitative data) and “why it happened” (qualitative explanations), making GraphRAG an essential tool for modern economic research and policy analysis.

Table of Contents

How to use GraphRAG for Economic Data Analysis (Tutorial)

Introduction

Why GraphRAG for Economic Analysis?

GraphRAG Advantages

Tutorial: Building an Economic Analysis GraphRAG System

Prerequisites and Setup

Step 1: Environment Setup

Step 2: Collecting Structured Data from World Bank

Step 3: Processing Unstructured Documents

Step 4: Designing the Knowledge Graph Schema

Step 5: Setting up the Vector Store

Step 6: GraphRAG Query System

Testing the Economic GraphRAG

Advanced Features and Extensions

Best Practices and Considerations

Conclusion

Table of Contents

How to use GraphRAG for Economic Data Analysis (Tutorial)

Introduction

Why GraphRAG for Economic Analysis?

GraphRAG Advantages

Tutorial: Building an Economic Analysis GraphRAG System

Prerequisites and Setup

Step 1: Environment Setup

Step 2: Collecting Structured Data from World Bank

Step 3: Processing Unstructured Documents

Step 4: Designing the Knowledge Graph Schema

Step 5: Setting up the Vector Store

Step 6: GraphRAG Query System

Testing the Economic GraphRAG

Advanced Features and Extensions

Best Practices and Considerations

Conclusion

Popular Posts