Join Synadia

At Synadia we are pioneering a new way for digital systems to connect and communicate between cloud, on-premise, and edge securely, in real-time, and in any environment. We love open source software (OSS)! We maintain and lead the development of NATS - a next generation distributed communications platform.

Distributed Systems | Performance and Reliability

Employment Type: Full-time

Level: Junior to Intermediate

Location: Remote

Job Summary

This job is not routine and requires creativity, critical thinking, expert troubleshooting, strong collaboration skills, and a desire to try something new. You will work primarily with a senior systems engineer on a long-term mission to improve the performance and reliability of the NATS ecosystem. NATS is natively flexible and composable. To deal with this complexity and large surface area, we apply a holistic approach to identifying performance and consistency issues before any users run into them.

Day to day, the job includes but is not limited to the following activities:

  • Design experiments to evaluate the system runtime behavior in a variety of scenarios
  • Design benchmarks targeting specific sub-systems
  • Develop tools for testing and analysis (e.g.,: load generators, telemetry aggregators, results visualization, etc)
  • Setup automation to catch errors and performance regressions proactively, and reproduce complex scenarios at will
  • Perform one-off deep-dive investigations to isolate the root cause of performance and reliability issues
  • Develop fault-injection techniques and tools to verify the system behaves correctly even when things go wrong (e.g.,: Jepsen, Chaos Monkeys, etc.)
  • Leverage formal methods to test system correctness (e.g. execution trace analysis using Elle)
  • Design experiments that purposely abuse systems, simulating DDOS attacks and data exfiltration attempts

Job Requirements

  • Bachelor’s degree in Computer Science or equivalent
  • Passion for distributed systems and cloud infrastructure, specifically aspects of scalability, dependability, and fault tolerance
  • Strong Unix/Linux systems-level systems level programming and troubleshooting skills
  • Understanding of protocols such as TCP-IP, UDP, TLS, HTTP, and the OSI model
  • Critical thinker, effective troubleshooter
  • Great communication and documentation
  • Proficiency in at least one of the following: Go, Python, Java, Rust, C/C++, Ruby

Preferred Qualifications

  • Familiarity with distributed systems literature (consensus protocols, consistency models, replication, etc)
  • Familiarity with cloud infrastructure security
  • Experience with messaging technologies such as NATS, JMS, MQTT, AMQP, and Kafka
  • Experience with distributed tracing and monitoring solutions
  • Experience working with cloud providers such as AWS, Azure, or GCP