a.x61.sh

Latest in Open Source Multimodal AI (open.substack.com)
from yogthos@lemmy.ml to technology@lemmy.ml on 23 Dec 2025 16:10
https://lemmy.ml/post/40709616

PE-AV - Audiovisual Perception with Code

Meta’s perception encoder for audio-visual understanding with open code release.
Processes both visual and audio information to isolate sound sources.
Paper | Code

preview.redd.it/k6lp7cgbou8g1.png?width=1456&form…

T5Gemma 2 - Open Encoder-Decoder

Next generation encoder-decoder model with full open-source weights.
Combines bidirectional understanding with flexible text generation.
Blog | Model

Qwen-Image-Layered - Open Image Decomposition

Decomposes images into editable RGBA layers with full model release.
Each layer can be independently manipulated for precise editing.
Hugging Face | Paper | Demo

reddit.com/link/1ptg2x9/video/…/player

N3D-VLM - Open 3D Vision-Language Model

Native 3D spatial reasoning with open weights and code.
Understands depth and spatial relationships without 2D distortions.
GitHub | Model

reddit.com/link/1ptg2x9/video/…/player

Generative Refocusing - Open Depth Control

Controls depth of field in images with full code release.
Simulates camera focus changes through 3D scene inference.
Website | Demo | Paper | GitHub

StereoPilot - Open 2D to 3D Conversion

Converts 2D videos to stereo 3D with open model and code.
Full source release for VR content creation.
Website | Model | GitHub | Paper

reddit.com/link/1ptg2x9/video/…/player

Chatterbox Turbo - MIT Licensed TTS

State-of-the-art text-to-speech under permissive MIT license.
No commercial restrictions or cloud dependencies.
Hugging Face

reddit.com/link/1ptg2x9/video/…/player

FunctionGemma - Open Function Calling

Lightweight 270M parameter model for function calling with full weights.
Creates specialized function calling models without commercial restrictions.
Model

FoundationMotion - Open Motion Analysis

Labels spatial movement in videos with full code and dataset release.
Automatic motion pattern identification without manual annotation.
Paper | GitHub | Demo | Dataset

DeContext - Open Image Protection

Protects images from unwanted AI edits with open-source implementation.
Adds imperceptible perturbations that block manipulation while preserving quality.
Website | Paper | GitHub

EgoX - Open Perspective Transformation

Transforms third-person videos to first-person with full code release.
Maintains spatial coherence during viewpoint conversion.
Website | Paper | GitHub

reddit.com/link/1ptg2x9/video/…/player

Step-GUI - Open GUI Automation

SOTA GUI automation with self-evolving pipeline and open weights.
Full code and model release for interface control.
Paper | GitHub | Model

IC-Effect - Open Video Effects

Applies video effects through in-context learning with code release.
Learns effect patterns from examples without fine-tuning.
Website | GitHub | Paper

threaded - newest