Implementing Embeddings via ONNX with Semantic Kernel for Local RAG Solutions in .NET


Introduction

In our previous article «Building a RAG API with .NET, Semantic Kernel, Phi-3 and Qdrant», we focused on setting up a local Retrieval-Augmented Generation (RAG) API with minimal dependencies on external services. One key piece of that architecture, which we only briefly touched upon, is the embedding service.

In this follow-up, we’ll dive deeper into how to implement an embedding generator using ONNX models directly in .NET, powered by Semantic Kernel. This is the missing link that brings together raw text and vector databases like Qdrant in your local RAG stack. We’ll also update some of the service registrations and implementation details based on our experience evolving the solution.


Why Local Embeddings?

Relying on external APIs like OpenAI for embeddings comes with drawbacks:

  • Cost per request
  • Latency
  • Privacy and data residency concerns

By using local ONNX models, you gain:

  • Full control over the embedding process
  • Faster response times
  • Zero API cost
  • Better integration with edge or disconnected environments

Recommended Model: bge-micro-v2

A recommend model is bge-micro-v2, a small, efficient embedding model:

  • Lightweight (~90MB ONNX file)
  • Fast inference without GPU
  • Semantically rich embeddings
  • Compatible with Semantic Kernel’s ONNX connector

Required Files:

  • model.onnx
  • vocab.txt

To download the files from Hugging Face CLI:

huggingface-cli download TaylorAI/bge-micro-v2 --local-dir ./bge-micro-v2

Code Adjustments from Our Previous RAG Setup

As this article builds directly on top of our previous RAG implementation, we’ve made small but important refinements to key service registrations and pipeline components. These ensure that local embedding generation integrates seamlessly with Semantic Kernel and works with real-world constraints like vector dimensionality.

To support embedding generation locally, and to ensure compatibility with Semantic Kernel’s new ONNX integration, we’ve made several improvements over the original implementation.

Updated Kernel Registration in ASP.NET Core

In this new version of our RAG setup, we register the Semantic Kernel and the embedding generator together with Phi-3 for generation:

builder.Services.AddSingleton<Kernel>(_ =>
{
    const string modelPath =
        @"C:\ai-models\phi-3\models\Phi-3-mini-4k-instruct-onnx\cpu_and_mobile\cpu-int4-awq-block-128";

    var kernel = Kernel.CreateBuilder()
        .AddOnnxRuntimeGenAIChatCompletion("phi-3", modelPath)
        .AddBertOnnxEmbeddingGenerator(
            onnxModelPath: @"C:\ai-models\bge-micro-v2\onnx\model.onnx",
            vocabPath: @"C:\ai-models\bge-micro-v2\vocab.txt")
        .Build();

    return kernel;
});

builder.Services.AddSingleton<IEmbeddingService, EmbeddingService>();

builder.Services.AddSingleton<IEmbeddingGenerator<string, Embedding<float>>>(sp =>
{
    var kernel = sp.GetRequiredService<Kernel>();
    return kernel.GetRequiredService<IEmbeddingGenerator<string, Embedding<float>>>();
});

Updated QdrantIndexer with Correct Embedding Dimensions

To match the 384-dimension output of bge-micro-v2, we updated the QdrantIndexer to avoid dimension mismatch errors like expected dim: 1536, got 384:

public class QdrantIndexer(
    QdrantClient client,
    ProductCatalogService productService,
    IEmbeddingService embeddingService)
{
    public async Task IndexProductsAsync()
    {
        try
        {
            var collections = await client.ListCollectionsAsync();
            if (!collections.Contains("products"))
            {
                await client.CreateCollectionAsync(
                    "products",
                    new VectorParams { Size = 384, Distance = Distance.Cosine });
            }

            var embeddings = await Task.WhenAll(
                productService.GetAllProducts().Select(async p =>
                    new PointStruct
                    {
                        Id = new PointId { Num = (ulong)p.Id },
                        Vectors = new Vectors
                        {
                            Vector = new Vector
                            {
                                Data =
                                {
                                    (await embeddingService.GenerateEmbedding(p.Description)).ToArray()
                                }
                            }
                        },
                        Payload =
                        {
                            { "id", new Value { IntegerValue = p.Id } },
                            { "name", new Value { StringValue = p.Name } },
                            { "description", new Value { StringValue = p.Description } }
                        }
                    }));

            // Use a batch approach to avoid large payload issues
            const int batchSize = 100;
            for (var i = 0; i < embeddings.Length; i += batchSize)
            {
                var batch = embeddings.Skip(i).Take(batchSize).ToList();
                await client.UpsertAsync("products", batch);
            }
        }
        catch (Exception ex)
        {
            throw new Exception($"Error indexing products: {ex.Message}", ex);
        }
    }
}

💡 Note: In the earlier setup we used a dummy vector of 1536 dimensions. Now with bge-micro-v2, the output vector has a dimension of 384, and we’ve updated our QdrantIndexer accordingly to avoid mismatched dimensions like the error: Vector dimension error: expected dim: 1536, got 384.


Embedding Service Implementation

We’ve moved from a dummy placeholder to a production-ready embedding service that integrates directly with Semantic Kernel:

public class EmbeddingService(IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator)
    : IEmbeddingService
{
    // Fist simple demo just using a dummy vector.
    //public float[] GenerateEmbedding(string text)
    //{
    //    // Dummy vector, replace with real embeddings
    //    return new float[1536];
    //}

    public async Task<ReadOnlyMemory<float>> GenerateEmbedding(string text)
    {
        return await embeddingGenerator.GenerateVectorAsync(text);
    }
}

Why Embedding Models Matter

While we’ve chosen bge-micro-v2 as the default model for this RAG implementation, the truth is — there’s no one-size-fits-all solution when it comes to embeddings.

Different applications have different constraints: performance, language support, model size, or domain-specific needs. That’s why I’ve included a few representative case studies below.

These examples will help you:

  • Choose the right embedding model for your use case (not just follow the default).
  • Understand how model characteristics like vector dimension, multilingual support, or inference speed affect your application.
  • See how easily you can switch or integrate alternatives within the same Semantic Kernel pipeline.

Let’s take a look some others models.

🧠Ultra-Light for Prototyping (all-MiniLM-L6-v2) (all-MiniLM-L6-v2)

If your main concern is speed and size, and you can compromise a bit on semantic richness:

  • Model: all-MiniLM-L6-v2
  • Size: ~50MB
  • Strength: blazing fast inference
  • Use case: small devices, fast lookups, prototyping

Download via Hugging Face CLI:

huggingface-cli download sentence-transformers/all-MiniLM-L6-v2 --local-dir ./all-MiniLM-L6-v2

You may need to convert the model to ONNX using tools like optimum.


🌍Multilingual & Hybrid Search (bge-m3) (bge-m3)

For production-grade systems with multilingual content and hybrid search:

  • Model: bge-m3
  • Size: ~300MB+
  • Strength: multilingual + dense/sparse embeddings
  • Use case: global e-commerce, hybrid RAG with ElasticSearch and Qdrant

Download via Hugging Face CLI:

huggingface-cli download BAAI/bge-m3 --local-dir ./bge-m3

Remember to update your Qdrant collection’s vector dimension based on the model’s output.


🧪Custom Embedding Service (Simplified)

You can implement your own embedding service when experimenting or prototyping. Here’s a mockup of a basic custom embedding generator. This is a bit better than our initial dummy!:

public class DummyEmbeddingService : ITextEmbeddingGenerationService
{
    public IReadOnlyDictionary<string, object?> Attributes => new Dictionary<string, object?>();

    public Task<IList<ReadOnlyMemory<float>>> GenerateEmbeddingsAsync(
        IList<string> texts,
        Kernel? kernel = null,
        CancellationToken cancellationToken = default)
    {
        var output = texts.Select(text =>
        {
            var hash = text.GetHashCode();
            var vector = Enumerable.Range(0, 128).Select(i => (float)((hash + i) % 100) / 100).ToArray();
            return new ReadOnlyMemory<float>(vector);
        }).ToList();

        return Task.FromResult<IList<ReadOnlyMemory<float>>>(output);
    }
}

This implementation is not semantically meaningful, but it’s handy for testing end-to-end RAG flows before you integrate a real embedding model.


Why Not Use Phi-3 or Phi-4 for Embeddings?

Models like Phi-3 and Phi-4 are powerful language models optimized for text generation, not for producing sentence-level embeddings.

While technically you could extract hidden states or apply pooling manually to simulate embeddings, these models:

  • Don’t expose an embedding layer directly (like BERT’s [CLS] token output)
  • Are not optimized for semantic vector similarity
  • Don’t integrate with Semantic Kernel’s embedding interface

For now, it’s best to use Phi-3 for generation and models like bge-micro-v2 or bge-m3 for retrieval embeddings.


Conclusion

Here’s a quick comparison of the models covered in this article:

ModelMultilingualSpeedQualityDimensionBest For
bge-micro-v2FastGood384Local RAG, E-commerce
bge-m3MediumVery Good1024+Global & Hybrid Search
all-MiniLM-L6Very FastMedium384Prototyping, edge devices
Custom (dummy)N/AInstantNoneVariableLocal test & debug

With local ONNX models and Semantic Kernel, embedding generation in .NET becomes a powerful, cost-effective, and private solution. This enables fully offline RAG pipelines when combined with models like Phi-3 for generation and Qdrant for retrieval.

Whether you’re optimizing for speed, multilingual support, or custom control, there’s a local embedding strategy that fits.

This article builds directly on the setup introduced in our previous post — and brings the missing (dummy) piece of semantic encoding into focus.

Whether you’re optimizing for speed, multilingual support, or custom control, there’s a local embedding strategy that fits.

And remember, all the code in the Github repo.


Hope this help. Happy A.I Codding !

References

Deja un comentario

Este sitio utiliza Akismet para reducir el spam. Conoce cómo se procesan los datos de tus comentarios.