Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Video File Attachment Error in Big-AGI 2.0 RC1 #761

Open
1 task
powyncify opened this issue Feb 20, 2025 · 4 comments
Open
1 task

[BUG] Video File Attachment Error in Big-AGI 2.0 RC1 #761

powyncify opened this issue Feb 20, 2025 · 4 comments
Labels
type: bug Something isn't working

Comments

@powyncify
Copy link

powyncify commented Feb 20, 2025

Environment

Big-AGI 2.0-rc1 deployed on Vercel

Description

Attempting to attach a video file (.mp4) results in an error message, and the file is ignored during processing. This issue occurs across several models but is particularly critical in the Gemini 2.0 models, which are designed to excel in video analysis. Notably, the same .mp4 file processes successfully when uploaded to Gemini AI Studio using the same models, either Gemini Flash 2.0 or Gemini 2.0 Pro Experimental

Device and browser

Edge on Windows 11. Big-AGI 2.0.0-rc1 deployed on Vercel.

Screenshots and more

Image

Willingness to Contribute

  • 🙋‍♂️ Yes, I would like to contribute a fix.
@powyncify powyncify added the type: bug Something isn't working label Feb 20, 2025
@enricoros
Copy link
Owner

Thanks@powyncify, I confirm that videos are not supported as an input type, for any model.

Great request, I don't recall other people asking for this yet. There are various limitations (e.g. Vercel max request size of 4.5 MB) and infrastructure constraints (Gemini having to upload on some temp storage) and given it's only supported by a single vendor (Gemini) this is probably not gonna come anytime soon in Big-AGI.

I believe Gemini takes videos and converts them to sequence of images, 1 second apart. Doing that would make videos work with any Vision model. Would that be an option, and what's your full use case?

@powyncify
Copy link
Author

Bonjour @enricoros

Thank you so much for your incredibly quick attention to this bug report! We truly appreciate it. And on a broader note, I want to commend you on your vision in creating Big-AGI. It's shaping up to be what I believe is the best interface for LLMs available (and we've tested many, many of them).

You're absolutely right in recognizing our use of video. We're essentially using it as a convenient way to force image recognition. To clarify our use case: we typically record long strings of text – things like chat messages or social media posts – using video. Then, we perform OCR on the video frames. We don't need the system to process video per se; image processing is sufficient.

Currently, we achieve this by recording the text using video at a low frame rate (around 5 frames per second), which results in relatively small and manageable file sizes. If Big-AGI could handle this workflow by processing the video as a sequence of images, as you suggested, that would be absolutely wonderful and perfectly address our needs.

Thanks again for your responsiveness and for building such an amazing tool!

@enricoros
Copy link
Owner

Thanaks for describing the use case. I could implement "backward compatibility" of Videos to Text (as many LLMs don't even support images yet), with:

  • extraction of all video frames to images (5 per second in your case), eventually with sub-sampling (e.g. every 1, 5, 24, 30, 60 frames)

  • OCR of images to text.

This would generate a frame by frame OCR of the input video with text, and any LLM would (e.g. DeepSeek R1) would be able to process it effectively.

Although I can't work on this right now, I like this solution (which is similar to the 1 frame / s conversion to images that Gemini does given videos), as it enables a lossy fallback to images, or to text frames.

Thanks for the idea, hopefully one day we can get to developing this, it would be fun.

(Also note that we have the option to record directly from the screen, but only for 1 frame right now).

@powyncify
Copy link
Author

@enricoros thank you for considering this feature request! The proposed "backward compatibility" approach for video-to-text conversion is an excellent idea for a future enhancement. We understand that implementing it may not be immediately feasible.

In the meantime, a simpler solution to address the initial bug report (#761) would be to simply allow the upload of video files to models that natively support them, such as Gemini. This would provide a direct workaround for our use case and allow us to leverage the existing video processing capabilities of those models without requiring any additional conversion logic within Big-AGI.

This approach would resolve the error message and enable us to utilize video input with compatible models, while the more comprehensive video-to-text feature can be considered for future development.

Thanks for your consideration!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants