DebuggingLast Updated: 2021-02-02
Outside-In debugging for Distributed Systems
Grouparoo is a distributed system. A distributed system is an application that has dependencies on other systems like databases and load balancers, or operates in parallel with other instances of itself. Grouparoo does both!
The first step in debugging a distributed system is to determine where in the distributed system the problem is occurring. We’ll be using the term
Tier to describe each logical section of the distributed system. Grouparoo's Tiers can be visualized on the Topology Page. We need to understand the tiers because an error being reported in one tier may be caused by a problem in another tier or the connections between them.
A Grouparoo deployment has 4 main tiers and a few sub-tiers, depending on your deployment strategy:
- The Application Tier
- Web Servers
- Background Worker Servers
- The Data Tier
- Grouparoo Postgres Storage (Database)
- Grouparoo Redis Storage (Cache)
- Grouparoo File Storage (S3)
- Sources and Destinations Tier
- Remote Databases (like MySQL, Postgres, etc)
- Remote Services / APIs (like Mailchimp, Hubspot, etc)
- Networking Tier
- Load Balancers
- Firewalls and Permissions (IAM)
Grouparoo tiers may be deployed in both a vertically scaled and horizontally scaled way. Keeping in mind how each of your tiers has been scaled will be important for debugging. For example, are there multiple instances of each tier that may be in conflict and have different data? Or is there only one instance that you need to be very careful about accessing?
Vertical Scaling - Using a single server for a tier, but growing the capacity of that server is called Vertical Scaling. For example, there may be only a single database server that Grouparoo is using, but it may be scaled by increasing the RAM and Hard Drive space of that one server. This approach is a good choice for ephemeral storage (i.e. data loss might be inconvenient, but isn't necessarily devastating) like Redis. Vertical Scaling is also a good choice for the Grouparoo Postgres server if backups of it are made regularly.
Horizontal Scaling - Horizontal Scaling is when you copy a server and have more than one of the same type of instance in a tier running at the same time. This approach is how we recommend that the Grouparoo application tier be scaled- you don’t need particularly large servers but as your data volume grows, you’ll want to run more and more Web and Background Worker Servers.
A whole class of errors exist when there is an imbalance between the scale of your tiers. For example, if you have too many Application Tier servers for your Data Tier to handle, you’ll have very slow performance or even timeouts and connection errors.
It’s important to consider the connectivity between the tiers. Each tier may functioning OK on its own, but perhaps the issue is between tiers. Connection issues often manifest as networking or permission issues.
For example, can the Grouparoo Application tier reach your MySQL source? Have you configured a MySQL user for MySQL? What about opening the relevant firewall ports (or Security Group, IAM role, etc) so the Grouparoo Application Tier servers can reach the Database? What about reaching your Sources and Destinations? Can your Application Tier reach the public internet? Are your vendors having an outage at the moment?
Outside-In debugging starts at the highest level of the system and works in towards the center. Debug in this order:
Page or API route.
Are the servers for each tier up and running? Can you see that the application (Grouparoo app, database, etc) is up and running on each server? Test these directly if possible by SSH-ing into the machine or starting a console. Check application logs or your APM tool.
Can the Grouparoo Application reach the Database? Can users get through the Load Balancer to the Grouparoo Application behind it? Test these connections directly if possible by SSH-ing into one machine or starting a console and try to reach the other via the
Can you reach your sources and destinations via another method or are they having an outage? Have you changed API keys?
There are two types of "usage" to check - system and application.
System usage include RAM or Disk space, the resources that each server provides to the applications running on it. Test these resources directly on each system via logs,
du, and similar commands.
Application usage refers to the size of the data you are processing with Grouparoo. Are there more Profiles or Events than normal? Is your database filling up? Check these by connecting to each database in the Data Tier directly and reporting its usage metrics, like number of rows, slow queries, etc.
Has something you depend on changed recently that may have broken things? For Grouparoo, these kinds of changes usually look like changes to
package-lock.json - which is how Node.js applications store dependency versions.
However, there are other types of dependencies: operating system versions and packages, API versions of your sources and destinations, etc. Check to see if there are any new changes.
Can you break things in the same way multiple times? Reproducible bugs are possible to fix, and it’s important to explain why they happen. Does visiting a certain page crash things? Does importing or exporting a certain profile cause an error?
Sharing this information to reproduce the problem is important when reporting a bug.
The worst kind of problem is a non-reproducible error. The error is happening fairly regularly, but you don’t know why. More troubling, the error doesn’t happen all the time with the same behavior. Exhausting all the other possibilities on this list tells you that you need more information. Computers are deterministic... but you don’t know enough yet to determine what’s going on.
Logs and other forms of instrumentation now need to be added, and the application(s) will be re-deployed. Once more information is gathered, you can start the process over again.