Skip to content

INT-FlashAttention2024/INT-FlashAttention

Repository files navigation

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

This repository is the official implementation of INT-FlashAttention.

About this repository

  • flash_atten_*.py contains the main code for different Flash Attention algorithm in Triton implementation.
  • benchmark.py contains the performance benchmark for different algorithm above.
  • configs.py contains the configs you may need to adjust for the Triton autotune.
  • csrc contains ours algorithm implementation in cuda version. You should reference the official repository for Flash Attention to compile it.
  • More details can be found in the folders.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published