API Reference

LogosKG is a production-grade library for efficient multi-hop knowledge graph retrieval, optimized specifically for LLM-KG applications at scale.

Knowledge Graph

LogosKG operates on graph data structured as a list of (head, relation, tail) triplets. Before initializing the engine, ensure your knowledge graph is parsed into this standard format.

Pre-build UMLS SNOMED CUI graph object (with physician-selected relations pertinent to diagnosis): download

This file is about 700 MB.

Reference: The customized clinical relations and graph subsets are derived from the DR.KNOWs repository ↗.

Installation & Setup

Bash

# 1. Clone the repository
git clone https://github.com/LARK-NLP-Lab/LogosKG.git

# 2. Enter the repository directory
cd LogosKG-Efficient-and-Scalable-Graph-Retrieval

# 3. Install dependencies from requirements.txt
pip install -r requirements.txt

Core Architecture

💡

Vectorized Topology: The graph is decomposed into three CSR matrices: Subject Matrix (Sub), Object Matrix (Obj), and Relation Matrix (Rel). This transforms pointer-chasing into highly optimized matrix multiplications.

LogosKG (Small / In-Memory Engine)

The standard high-performance engine designed for knowledge graphs that fit entirely within system RAM or GPU VRAM.

class LogosKG(triplets: List[Tuple[str, str, str]], backend: str = 'numba', device: str = 'cpu')

Initializes the engine, maps string entities to internal indices, and automatically constructs the CSR topology matrices.

Parameters	Description
tripletsList[Tuple[str, str, str]]	List of `(head, relation, tail)` tuples representing the graph.
backendstr = "numba"	Computation backend. Supported options: `"scipy"`, `"numba"`, or `"torch"`.
devicestr = "cpu"	Target hardware device. Use `"cuda"` when `backend="torch"` for GPU acceleration.

Returns

LogosKG

An initialized LogosKG engine instance ready for multi-hop queries.

1. retrieve_at_k_hop

def retrieve_at_k_hop(entity_ids: List[str], hops: int, shortest_path: bool = True) -> List[str]

Retrieves entities exactly hops away from the seed entities.

Parameters	Description
entity_idsList[str]	List of seed anchor entities (e.g., extracted symptoms).
hopsint	Exact traversal depth. Cannot be negative.
shortest_pathbool = True	If True, prevents revisiting nodes discovered in earlier hops.

Returns

List[str]

A list of unique entity string identifiers located exactly at the specified depth.

2. retrieve_within_k_hop

def retrieve_within_k_hop(entity_ids: List[str], hops: int, shortest_path: bool = True) -> List[str]

Retrieves an accumulated list of all entities discovered from hop 0 up to hops.

Returns

List[str]

A list of all unique entity identifiers encountered within the given depth.

3. retrieve_with_paths_at_k_hop

def retrieve_with_paths_at_k_hop(entity_ids: List[str], hops: int = 2, shortest_path: bool = True, max_paths_per_entity: Optional[int] = None) -> Dict[str, Any]

Retrieves entities at exactly K hops, returning both the entities and their reconstructed topological paths.

Parameters	Description
max_paths_per_entityOptional[int] = None	Limits the number of returned paths per target node to prevent memory explosion in dense subgraphs.

Returns

Dict[str, Any]

A dictionary containing "entities" (List[str]) and "paths" (Dictionary mapping endpoints to their path lists).

4. retrieve_with_paths_within_k_hop

def retrieve_with_paths_within_k_hop(entity_ids: List[str], hops: int = 2, shortest_path: bool = True, max_paths_per_entity: Optional[int] = None) -> Dict[str, Any]

Performs full path reconstruction for all entities discovered up to K hops. Crucial for providing interpretable context to LLMs.

Returns

Dict[str, Any]

A dictionary containing complete paths mapping seed anchors to every discovered entity.

GPU Batch Optimization

While LogosKG (Small) exposes single-query signatures, it contains a powerful internal automatic batching engine. If backend='torch' and multiple entity_ids are provided simultaneously, the engine dynamically switches to _retrieve_at_k_hop_torch_batched(), exploiting PyTorch sparse matrix multiplications across concurrent seed dimensions for massive throughput.

LogosKGLarge (Partitioned Engine)

For massive graphs (e.g., combining UMLS + PrimeKG) that exceed memory limits, LogosKGLarge implements disk-backed partitioning with an intelligent LRU cache memory management system, ensuring Out-Of-Memory (OOM) errors are completely avoided while maintaining graph consistency.

Initialization

class LogosKGLarge(partition_dir: str, backend: str = 'numba', device: str = 'cpu', cache_size: int = 10, triplets: Optional[List] = None, num_partitions: int = 16)

Parameters	Description
partition_dirstr	Directory containing partitioned data (`metadata.pkl`).
cache_sizeint = 10	Number of subgraph partitions to keep active in memory (LRU).
tripletsOptional[List]	If partitions don't exist, provide raw triplets here to trigger `KnowledgeGraphPartitioner` automatically.
num_partitionsint = 16	Target number of subgraphs to generate during automatic partitioning.

Returns

LogosKGLarge

A disk-backed, memory-efficient knowledge graph engine.

1. retrieve_at_k_hop

def retrieve_at_k_hop(entity_ids: List[str], hops: int, shortest_path: bool = True) -> List[str]

Performs a cross-partition hops depth traversal. Automatically manages dynamic loading and unloading of partition chunks via the LRU cache.

Returns

List[str]

List of entities exactly at depth K, seamlessly bridging multiple partitions.

2. retrieve_within_k_hop

def retrieve_within_k_hop(entity_ids: List[str], hops: int, shortest_path: bool = True) -> List[str]

Accumulates entities from hop 0 to K across all necessary partitions.

Returns

List[str]

List of all unique entities within the depth boundary.

3. retrieve_with_paths_at_k_hop

def retrieve_with_paths_at_k_hop(entity_ids: List[str], hops: int = 2, shortest_path: bool = True, max_paths_per_entity: Optional[int] = None) -> Dict[str, Any]

Tracks topological path indices across multiple graph partitions simultaneously.

Returns

Dict[str, Any]

Dictionary mapping endpoints at exactly hop K to their cross-partition topological paths.

4. retrieve_with_paths_within_k_hop

def retrieve_with_paths_within_k_hop(entity_ids: List[str], hops: int = 2, shortest_path: bool = True, max_paths_per_entity: Optional[int] = None) -> Dict[str, Any]

The most comprehensive method. Reconstructs every step taken across all partitions up to depth K.

Returns

Dict[str, Any]

Dictionary mapping all discovered endpoints to their complete pathways.

Batch Caching Optimization

Unlike standard single-query batching, LogosKGLarge provides specialized batch_retrieve_* methods. These methods mathematically analyze the subgraphs required for an entire array of user queries, sorting and clustering them internally to maximize LRU cache hits, drastically eliminating disk I/O bottlenecks.

def batch_retrieve_within_k_hop(batch_entity_ids: List[List[str]], hops: int, shortest_path: bool = True) -> List[List[str]]

Processes an entire batch of independent patient narratives / seed groupings simultaneously.

Returns

List[List[str]]

A 2D array of results mapped perfectly back to the original input query order.

def batch_retrieve_with_paths_within_k_hop(batch_entity_ids: List[List[str]], hops: int = 2, ...) -> List[Dict[str, Any]]

Batch version of the full path reconstruction algorithm with LRU cache sorting logic applied.

Returns

List[Dict[str, Any]]

A list of dictionaries, where each dictionary contains the reconstructed paths matching its respective input query.