Building Specialized AI Capabilities for Secure Software Development
We have been expanding our security-focused AI capabilities to support a more complete secure development lifecycle. We are thrilled to announce that we have trained and built our first, very own AI Model and we are releasing it as Cortex-LLM 1.0. The goal is straightforward: help engineering teams identify security issues earlier, understand them more clearly, and move toward remediation with less manual overhead.
This work brings together two complementary capabilities. One is focused on security analysis: reviewing selected code context and producing structured security findings that can be consumed by downstream systems. The other is focused on security remediation: helping convert a validated issue into a practical, targeted code change.
Together, these capabilities support a closed-loop workflow: analyze code, validate findings, recommend action, apply fixes, and re-check results.
Why Specialization Matters
Security work benefits from specialization. A general-purpose model can be useful for broad reasoning, but secure software workflows often require more precise behavior: consistent output formats, low false-positive rates, actionable recommendations, and a clear separation between assessment and remediation.
Rather than relying on one broad model to do everything, this work uses specialized model behavior for different stages of the workflow. Analysis-oriented behavior is optimized for structured findings and triage. Remediation-oriented behavior is optimized for producing focused security fixes after enough context has been gathered.
This separation improves reliability. It also makes evaluation more meaningful, because each capability can be measured against the task it is actually expected to perform.
Structured Outputs for Real Workflows
One of the most important themes has been structure. Security review output needs to be useful not only to humans, but also to tools and automation. A paragraph of explanation may be helpful in a chat interface, but production workflows often need machine-readable findings that can be routed, ranked, displayed, stored, or validated.
The analysis capability was therefore shaped around structured findings. Outputs are expected to include consistent elements such as severity, affected file, evidence, impact, and recommendation. This makes the result easier to integrate into developer tools, CI systems, review workflows, and human-in-the-loop security triage.
The remediation capability follows the same principle of focus. It is intended to produce targeted security-oriented code changes, not broad unrelated rewrites. In secure development, a good fix is not only correct; it is also minimal, reviewable, and aligned with the surrounding code.
Training for Behavior, Not Just Knowledge
The machine learning work behind this has focused less on adding generic knowledge and more on shaping behavior. The objective is not simply to make a model “know about security.” The objective is to make it behave consistently inside a software engineering workflow.
That means training and evaluation emphasize:
- Valid structured output
- Clear distinction between vulnerable and safe code
- Actionable evidence and recommendations
- Low false-positive behavior
- Stable formatting under varied inputs
- Useful responses for both analysis and remediation tasks
In practical terms, this required careful dataset construction, balancing vulnerable examples, and reinforcing the expected output contract. A system that reports too many false positives quickly loses developer trust.
Evaluation Across Multiple Levels
A single benchmark is not enough to understand whether a security model is useful. The evaluation approach now uses multiple layers.
A small smoke test checks for obvious regressions: invalid output, broken schemas, or missed common security scenarios. A curated evaluation set measures behavior across known vulnerability categories and safe-code cases. A broader held-out evaluation gives a better read on generalization across more varied real-world patterns.
This layered approach helps separate different kinds of failure. Sometimes a model understands the issue but returns the wrong structure. Sometimes the output is truncated. Sometimes a benchmark label is noisy. Sometimes the model is simply missing the vulnerability. Treating all of those as the same kind of failure would lead to the wrong next step.
Metrics from Evaluation Benchmarks
We are also releasing initial performance metrics for Cortex-LLM 1.0 using two widely adopted evaluation benchmarks.
First, we use CyberSecEval from the PurpleLlama project to measure secure instruction-following behavior. This helps evaluate whether a model can respond appropriately to security-sensitive prompts, follow safe development guidance, and avoid behavior that could introduce or amplify risk. This aligns directly with our broader goal of building AI capabilities that support secure-by-default software development.
Second, we use HumanEval to measure code completion accuracy across models. HumanEval provides a standardized way to evaluate whether a model can generate functionally correct code from programming prompts. While secure development requires more than code correctness alone, this benchmark helps establish a baseline for general coding capability, which is important for both security analysis and remediation workflows.
Cortex-LLM 1.0 shows a strong balance of capability and efficiency across both benchmark views. When plotted on secure instruction-following performance versus runtime performance using CyberSecEval from the PurpleLlama project, Cortex-LLM 1.0 lands in the desirable quadrant, combining strong security-aligned behavior with practical execution speed.
The same pattern holds for code completion accuracy versus runtime performance on HumanEval, where Cortex-LLM 1.0 again leads the comparison by delivering high coding accuracy without sacrificing responsiveness. Together, these results indicate that Cortex-LLM 1.0 is not only competitive on model quality, but also well-positioned for real-world developer workflows where both correctness and latency matter.
Production Reliability Requires More Than Model Accuracy
A useful security AI system is more than a model checkpoint. It also needs robust runtime behavior.
For structured findings, the surrounding system should validate outputs, normalize common response shapes, and fall back when the result is unusable. This kind of post-processing is common in production ML systems. It does not replace strong model behavior, but it makes the overall system more resilient.
For example, if a model returns an equivalent finding in a slightly different JSON shape, the system can often normalize it safely. If the output is invalid, empty, or incomplete, the workflow can fall back to a broader review path. This combination of model specialization, validation, normalization, and fallback gives the system a much stronger production posture than model output alone.
A Practical Secure Development Loop
The broader direction is a secure development workflow where AI assists at multiple stages without overreaching.
During review, specialized analysis behavior can help identify and structure likely issues. During triage, findings can be validated, ranked, or escalated. During remediation, focused security-fix behavior can help produce targeted patches. Afterward, the same review loop can help confirm that the issue has been addressed without introducing new risk.
This creates a more useful developer experience than isolated one-off suggestions. It also aligns better with how engineering and security teams actually work: context first, analysis second, remediation third, validation after that.
Current Direction
The latest work shows promising progress. Structured security analysis is becoming more reliable, especially when paired with validation and runtime normalization. Remediation behavior is being kept focused on explicit security fixes rather than broad code generation. Evaluation has become more realistic, with separate checks for curated behavior, held-out generalization, false positives, false negatives, and output validity.
The most important lesson is that improvement does not always mean training longer. In applied ML systems, the best next step may be better data, better evaluation, better inference settings, or better post-processing. Knowing which lever to pull is part of making these systems production-ready.
The result is a more practical foundation for AI-assisted secure development: specialized where it needs to be, structured enough for automation, and resilient enough to fit into real engineering workflows.
As we head into the 250th year of Independence in the USA, at Pervaziv AI, we envision achieving our own version of Independence – AI Model Independence. We’re excited for things on the horizon, stay tuned for more!


