AI45 Lab

All

29 repositories

X-Boundary
Public
The code repo of paper "X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability"
Python
•0•2•0•0•Updated Feb 17, 2025Feb 17, 2025
CELLO
Public
Python
•
Apache License 2.0
•0•0•0•0•Updated Feb 13, 2025Feb 13, 2025
MORE
Public
JavaScript
•
Apache License 2.0
•0•0•0•0•Updated Feb 13, 2025Feb 13, 2025
SelfConsciousness
Public
Python
•
Apache License 2.0
•0•0•0•0•Updated Feb 13, 2025Feb 13, 2025
CaLM
Public
Python
•
Apache License 2.0
•0•0•0•0•Updated Feb 13, 2025Feb 13, 2025
ADCE
Public
The official code for paper: Beyond Surface Structure: A Causal Assessment of LLMs' Comprehension ability.
Python
•1•0•0•0•Updated Feb 13, 2025Feb 13, 2025
SEER
Public
Self-Explainability Enhancement of LLMs’ Representations
Python
•0•5•0•0•Updated Feb 11, 2025Feb 11, 2025
ActorAttack
Public
Python
•3•74•0•0•Updated Feb 3, 2025Feb 3, 2025
ReflectionBench
Public
ReflectionBench
Python
•2•8•1•0•Updated Jan 20, 2025Jan 20, 2025
VLSBench
Public
Data and Code for Paper VLSBench: Unveiling Visual Leakage in Multimodal Safety
Python
•1•30•1•0•Updated Jan 17, 2025Jan 17, 2025
MM-SafetyBench
Public
Python
•0•0•0•0•Updated Jan 17, 2025Jan 17, 2025
ESC-Eval
Public
Python
•0•1•0•0•Updated Jan 17, 2025Jan 17, 2025
emulated-disalignment
Public
Python
•0•0•0•0•Updated Jan 17, 2025Jan 17, 2025
weak-to-strong-search
Public
Python
•0•2•0•0•Updated Jan 17, 2025Jan 17, 2025
modpo
Public
Python
•0•0•0•0•Updated Jan 17, 2025Jan 17, 2025
.github
Public
0•0•0•0•Updated Jan 16, 2025Jan 16, 2025
SALAD-BENCH
Public
【ACL 2024】 SALAD benchmark & MD-Judge
Python
•
Apache License 2.0
•0•1•0•0•Updated Jan 16, 2025Jan 16, 2025
REEF
Public
The repository of the paper "REEF: Representation Encoding Fingerprints for Large Language Models," aims to protect the IP of open-source LLMs.
Python
•3•40•0•0•Updated Jan 16, 2025Jan 16, 2025
CredID
Public
Python
•1•0•0•0•Updated Dec 26, 2024Dec 26, 2024
AIGC_detection
Public
Makefile
•
The Unlicense
•0•0•0•5•Updated Dec 23, 2024Dec 23, 2024
T2ISafety
Public
0•1•0•0•Updated Nov 25, 2024Nov 25, 2024
DEAN
Public
Python
•
Apache License 2.0
•1•10•0•0•Updated Oct 25, 2024Oct 25, 2024
CodeAttack
Public
[ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
Python
•
MIT License
•3•36•1•1•Updated Oct 25, 2024Oct 25, 2024
MLLMGuard
Public
Python
•2•21•2•0•Updated Oct 22, 2024Oct 22, 2024
ivg
Public
Official repository of the paper "Inference-Time Language Model Alignment via Integrated Value Guidance"
Python
•0•0•0•0•Updated Sep 27, 2024Sep 27, 2024
Persafety
Public
The repository of paper "The Better Angels of Machine Personality: How Personality Relates to LLM Safety".
0•3•0•0•Updated Jul 2, 2024Jul 2, 2024
TracingLLM
Public
Python
•
Apache License 2.0
•4•0•0•0•Updated May 22, 2024May 22, 2024
Flames
Public
Flames is a highly adversarial benchmark in Chinese for LLM's harmlessness evaluation developed by Shanghai AI Lab and Fudan NLP Group.
Apache License 2.0
•0•41•1•0•Updated May 21, 2024May 21, 2024
Fake-Alignment
Public
Python
•
Apache License 2.0
•0•5•0•0•Updated Mar 22, 2024Mar 22, 2024