ARemote Jobs Ace

JetBrains

Research Engineer (LLM Training and Performance)

Amsterdam, Netherlands; Belgrade, Serbia; Berlin, Germany; Limassol, Cyprus; London, United Kingdom; Madrid, Spain; Munich, Germany; Paphos, Cyprus; Prague, Czech Republic; Warsaw, Poland; Yerevan, Armenia

Role brief

What this role is asking for.

At JetBrains, code is our passion. Ever since we started back in 2000, we have been striving to make the strongest, most effective developer tools on earth. By automating routine checks and corrections, our tools speed up production, freeing developers to grow, discover, and create. We’re looking for a Research Engineer who will own the training stack and model architecture for our Mellum LLM family. Your job is easier said than done: make training faster, cheaper, and more stable at a large scale. You’ll profile, design, and implement changes to the training pipeline – from architecture to custom GPU kernels, as needed. As part of our team, you will: Be responsible for improving end-to-end performance for multi-node LLM pre-training and post-training pipelines. Profile hotspots (Nsight Systems/Compute, NVTX) and fix them using compute/comm overlap, kernel fusion, scheduling, etc. Design and evaluate architecture choices (depth/width, attention variants including GQA/MQA/MLA/Flash-style, RoPE scaling/NTK, and MoE routing and load-balancing). Implement custom ops (Triton and/or CUDA C++), integrate via PyTorch extensions, and upstream when possible. Push memory/perf levers: FSDP/ZeRO, activation checkpointing, FP8/TE, tensor/pipeline/sequence/expert parallelism, NCCL tuning. Harden large runs by building elastic and fault-tolerant training setups, ensuring robust checkpointing,

Company role signals

JetBrains role signals.

Repeated tags across 95 active roles show the current hiring pattern.
Support · 77ML / AI · 55Observability · 30Java · 25Python · 20APIs · 16Sales · 16Security · 15AWS · 9Data Engineering · 9