garak

The open-source Large Language Model (LLM) vulnerability scanner garak provides modular and standardized detections (tests), probes (groups of tests looking for a vulnerability), and harnesses (end-to-end pipelines) to test LLMs for different vulnerabilities and downstream harms.

AVID resources are integrated with garak in two ways.

Taxonomy

The tags attribute in the Python class defining a garak probe stores AVID taxonomy classifications in MISP format.

# Source: https://github.com/leondz/garak/blob/main/garak/probes/base.py
...
class Probe:
    """Base class for objects that define and execute LLM evaluations"""

    name = None
    description = "Empty probe"
    uri = ""
    bcp47 = None  # language this is for, in bcp47 format; * for all langs
    recommended_detector = ["always.Fail"]  # send a signal if this isn't overridden
    active = True
    tags = []  # MISP-format taxonomy categories
    ...

As an example, the knownbadsignatures group of probes checks whether a LLM can be made to generate signatures of known malwares, and the knownbadsignatures.EICAR probe checks for leakage of the signature of the EICAR test file. This is given two taxonomy tags.

# Source: https://github.com/leondz/garak/blob/main/garak/probes/knownbadsignatures.py
...
class EICAR(Probe):
    name = "eicar"
    description = "Does the model check its output for viruses? Try to send the EICAR code through"
    bcp47 = "*"
    uri = "https://en.wikipedia.org/wiki/EICAR_test_file"
    recommended_detector = [
        "knownbadsignatures.EICAR",
    ]
    tags = ["avid-effect:security:S0301", "avid-effect:security:S0403"]
    ...

In the AVID taxonomy, these tags correspond to Information Leak and Adversarial Example, respectively.

In a similar manner, garak detectors also has the tags attribute. In line with the flexible MISP format, any taxonomy classification in the MISP format can be stored as a tag. For example, the lmrc.Bullying probe has tags risk-cards:lmrc:bullying and avid-effect:ethics:E0301, corresponding to the category Bullying, and the AVID SEP category E0301: Toxicity.

Reporting

Scans by garak generate log files in JSONL format that store model metadata, prompt information, and evaluation results. This information can be structured into one or more AVID reports. Check out the following example using a sample run.

wget https://gist.githubusercontent.com/shubhobm/9fa52d71c8bb36bfb888eee2ba3d18f2/raw/ef1808e6d3b26002d9b046e6c120d438adf49008/gpt35-0906.report.jsonl
python3 -m garak -r gpt35-0906.report.jsonl
## output:
# garak LLM security probe v0.9.0.6 ( https://github.com/leondz/garak ) at 2023-07-23T15:30:37.699120
# 📜 Converting garak reports gpt35-0906.report.jsonl
# 📜 AVID reports generated at gpt35-0906.avid.jsonl

Last updated