Introduction
At Alinia, our mission is making interactions between humans and machines safer in the most critical scenarios and at scale. We pursue this goal in two complementary ways: through our no-code platform, which enables domain experts to configure controls around AI systems, and through proprietary compliance and security models designed to meet real regulatory and business needs. One example is our investment guard, built last year to address a specific financial compliance requirement.
Previously, we offered a combined safety-and-security guardrail. For this iteration, we intentionally narrowed our scope. By focusing exclusively on security threats—such as prompt injection and jailbreaks—we aimed to build a more precise, higher-quality model. In parallel, our work over the past year on toxicity detection also has informed improvements to our multilingual safety models, helping set a stronger foundation overall.
The Challenge
Securing AI systems across languages is not just a complex technical problem, which involves detecting and preventing threats like prompt injection, data exfiltration, and policy violations in multi-turn AI interactions, it is a language and context problem. Attacks that look obvious in English can appear subtle, indirect, or culturally embedded in other languages. A strong security guard must recognize harmful intent even when it is expressed differently, obliquely, or through local linguistic conventions.
Data Preparation
Building effective training and test sets for multilingual security guardrails presented a unique set of challenges. We started by aggregating data from open source spanning jailbreak attempts, prompt injections, and adversarial attacks. The real challenge wasn’t volume, it was quality and relevance. After filtering for quality and relevance to our security focus, our usable examples reduced by almost 50%. We needed balanced representation across English, Spanish, and Catalan, but most existing datasets were English-only.
This meant aggressive translation efforts. Proprietary translation services delivered high-quality Spanish translations quickly, but at significant cost. Open-source alternatives provided a broader coverage of languages (Catalan and Spanish), but quality issues emerged. This tension between cost and quality became a recurring challenge, requiring us to carefully balance budget constraints against the need for clean, reliable multilingual training data.
Finding the right language balance turned out to be an iterative process. We experimented with several different language distributions across multiple rounds, gradually converging on a mix that boosted Catalan performance without compromising accuracy in English or Spanish.
Taxonomy Generation
We wanted to granularly categorize the types of attacks, so we created our own taxonomy. We reviewed existing frameworks, attempting to synthesize them into a coherent classification system for our needs. What we quickly discovered was that security attack categories have significant overlap— for example, distinguishing between «contextual manipulation» and «role-play scenarios» in practice proved far more difficult than in theory. After multiple iterations, we created a hierarchical taxonomy that we could utilize for synthetic data generation and human data curation efforts.
Collecting Red Teaming Data
We designed a red teaming challenge for university students to collect red teaming data that could break through our Security Guard (SecGuard), and potentially cover any gaps in our first data collection run. This was part of guest lectures for Data Analytics Masters students at Universitat Pompeu Fabra Business School of Management (UPF-BSM). We started by educating the students on what red teaming is, what guardrail models are, and why testing and red teaming is so important for development. After that, students utilized a UI specifically designed to make red teaming fun to interact with SecGuard. We collected the data, including their inputs (prompt) as well as when they managed to break the security guard. They also had the option to indicate what kind of attack it was based on our taxonomy.
Better Data, Smaller Models, Stronger Guardrails
These lessons directly shaped our second candidate model. Cleaning mislabeled data, incorporating human red teaming examples, and using high-quality synthetic attacks allowed us to train a smaller, faster, and more accurate model—with better multilingual performance and lower latency.
Results
We evaluated five candidate models, as well as the current production model (SecGuard v1) against three different in-distribution test sets to understand performance across different data distributions. We quickly narrowed it down to 2 top candidate models.
Candidate-2, which was trained after incorporating cleaner data, red teaming examples, and synthetic adversarial samples, significantly outperformed all other models in 2 data sets, and candidate-1 in the 3rd data set, where SecGuard v1 was the winner.
Given a new and improved test set, candidate-2 achieved the strongest overall performance at 98% accuracy, showing impressive gains across all three languages. Candidate-1’s performance remained below 90% on this test set, and SecGuard v1 showed weaker performance at 61% overall, highlighting the value of our focused security-specific training approach.
Given the clear superiority across all 3 languages, we chose candidate-2 as the model for the next version of our security guardrails, SecGuard v2.
However, it’s important to note that these results reflect performance on data similar to our training distribution in all three test sets. In real-world deployments, organizations may encounter domain-specific language patterns, industry jargon, or novel attack vectors that differ from our training distributions, which could require additional fine-tuning or adaptation to maintain this level of performance.
Latency
Beyond accuracy improvements, SecGuard v2.1 delivers substantial latency gains over v2.0. We benchmarked performance across two configurations: pure GPU execution time using the Transformer backend, and end-to-end latency on our production infrastructure using TensorRT.
On the Transformer backend, which measures isolated GPU inference time, v2.1 achieved a 2.43x speedup with mean latency of 83.15ms compared to v2.0’s 201.96ms. The production deployment tells a similar story. Running on TensorRT, end-to-end latency improved 1.29x from 169.4ms to 131.2ms. Breaking this down further, GPU execution time dropped from 44-70ms in v2.0 to just 13-20ms in v2.1—roughly a 3x improvement at the model level. Network overhead remains consistent at around 120ms for both versions, which partially masks the raw performance gains in end-to-end metrics.
Limitations
Despite strong performance improvements in SecGuard v2, several limitations remain. The model is a binary classifier. While multi-label security classification remains an important long-term goal, overlapping attack patterns make it difficult to achieve reliably today.
Translation quality —especially for Catalan— also limited usable data, and some linguistic edge cases may remain underrepresented.
Finally, the model prioritizes recall to minimize missed attacks. This can lead to occasional false positives, which organizations can mitigate through threshold tuning based on their risk tolerance.
Try Alinia’s Security Guard today
There are several ways to try our Security Guard today, whether you are non-technical or a developer.
Non-technical users
You can immediately try Alinia’s security guard by going to Alinia’s Security Guard public demo:
https://huggingface.co/spaces/alinia/sec_guard_demo
This demo is designed for a general audience to be able to try it and provide feedback; no technical skills are required to interact with it.
Alinia’s Security Guard demo shows both the unguarded LLM response and the Alinia guarded response side-by-side as shown in the screenshot below.
Users can adjust the “Detection threshold” for the security guard to find the best signal vs noise ratio. Currently, the model is calibrated so that if the probability is above 0.5 it will be flagged as a potential violation and thus the user query will be blocked.
As shown in the results section, Alinia Security Guard v2 detects adversarial attacks in Catalan and Spanish with higher accuracy than our previous model (v1). This can be seen by comparing the new version of Alinia Security Guard (v2) and the previous version of Security Guard (v1), which are shown on the demo as follows:
Adversarial attach likelihood: <% by v2> (v1: <% by v1>)
Example: Adversarial attach likelihood: 99.96% (v1: 3.73%)
When the guard model makes a mistake by flagging benign, safe content (False Positive) or by not flagging adversarial content (False Negatives), users can provide this feedback by clicking on the “Send Feedback” button at the bottom of the screen. This will help our team continue to improve our Security models. See screens below.
Developers
For developers, any-guardrail provides an easy way to try and integrate Alinia’s Security guards (v1 and v2) as well as all the other Alinia guardrails, including compliance and safety guards. Any-guardrail provides a unified framework to switch out and experiment with any guardrail, which can help developers see how Alinia’s Security guards compare to competitors.
To view complete and up to date documentation on Alinia, please request an API key.
For more information on how to integrate Alinia guardrails via any-guardrail, see this blog post. Make sure to try this cookbook for a more in-depth introduction to our API through any-guardrail.
Conclusion
At Alinia, we work tirelessly to bridge the gap between multilingual AI safety and security and the practical needs of global enterprises.
In this blog post we share what it takes to build and improve a multilingual security guardrail.
We are excited for customers to try and benefit from our latest multilingual security guardrails improvements to continue to make AI deployments more secure across languages in critical business scenarios.
Acknowledgments
Daniel Nissani for his leadership and valuable collaboration throughout this project.
Ana Freire for making it so easy to run Red Teaming Competitions during her masters classes at UPF-BSM on very short notice.
Barcelona Supercomputing Center (BSC) and the Aina Project for their support on this project, and for their continuous investment in accelerating the use and development of artificial intelligence tools in Catalan, a low-resource language when compared to languages like English and Spanish. They enable companies like ours to connect with many other companies and institutions to share our multilingual research.