Before developing Bitrix24, we've come up with the following requirements to the architecture of Bitrix24:
- There are Free plan accounts, so we need to keep these accounts prime cost as low as it's possible.
- Bitrix24 is a business application. It means that the load on servers will be uneven, it'll be higher at noon and lower at night. So we need to have scalable architecture and use exactly as many resources as needed at any given time.
- At the same moment, reliability is extremely important for any business app. Data must be secure and available at any moment.
- We've started working at three different markets: USA, Germany, Russia.
These requirements have defined two main goals: forming a scalable fault-tolerant cloud development platform and selecting a technology platform for the infrastructure of the project.
Bitrix24 is built as a cluster of interchangeable web servers. If the load on servers increases, more servers can be added to a cluster in no time. If any server fails, clients won't feel it as everything keeps on working using other servers of the cluster.
Cloud storage support solves the problem of static content synchronization. Master-master replication in MySQL allows building geographically distributed web clusters.
We use Amazon AWS, but other platforms may be used as well.
The application (web) is scaled horizontally (adding new machines), not vertically (increasing server capacity).
To do that, we use Elastic Load Balancing + CloudWatch + Auto Scaling. ELB (Elastic Load Balancing) automatically distributes incoming application traffic (HTTP and HTTPS). Load increase and decrease are tracked through the CloudWatch.
When the load increases, new servers are turned on. If the load decreases, additional servers are switched off automatically. Thus, we reduce the prime cost (redundant servers don't work idle).
When a client creates a new Bitrix24 Account, a personal Amazon S3 account is created for each Bitrix24 Account to store the data. Thus, data related to each Bitrix24 Account are isolated from each other. Also, the S3 storage itself is completely secure.
Data in Amazon S3 are replicated to several points. Moreover, in geographically distributed points (different data centers). Each of the storage devices is monitored and quickly replaced if any malfunction is registered.
When you upload new files to the storage, you will get a notification about a successful upload only when the file is successfully saved at several different points. Typically, data are replicated to three or more devices to ensure fault tolerance, even if two of these devices have failed.
S3 architecture is built so that Amazon is ready to provide availability at the level of two nines after the decimal point. And the probability of data loss is one-billionth of a percent.
17 data centers and master-master replication
The entire project is located in 17 different data centers located all over the world. Client data are stored in those countries in which they are required to be stored by law. Thus, we solve two problems at once: we distribute servers load (for example, German users work in one DC, and American users in another), and we reserve all services: in case of failure of one of the DCs, we just switch the traffic to another.
Database in each DC is a master to another slave DC, and at the same time, it's a slave to another master DC.
Each Bitrix24 Account (all employees registered in it) at any given moment works with only one DC and one database. Switching to another DC is carried out only in case of any failure.
Databases in different DCs are synchronous but independent from each other. The connectivity between data centers can be lost for several hours. In such cases, data are synchronized after connectivity recovery.
We also use master-master replication. If the database server crashes or reboots, clients are immediately switched to another server.
Reliability and fault tolerance
One of the top priorities at Bitrix24 is the constant availability of the service and its fault tolerance.
If there is an accident on one or several web nodes, Load Balancing determines the failed machines, and, based on the specified parameters (the minimum required number of running machines), the required number of instances is automatically restored.
If the connectivity between data centers is lost, each data center continues to serve its segment of customers. After the connectivity is recovered, the data in the databases are automatically synchronized.
If the data center completely fails, all traffic is automatically switched to another data center.
If it causes an increased load on the machines, CloudWatch determines the increased CPU utilization and adds the needed number of machines in one data center according to the AutoScaling rules.
At the same time, master-master replication is paused. After carrying out the necessary work, turn the database back on and restore replication.
If everything is good, the traffic is distributed between the data centers. If the average load has fallen below the threshold value, the extra machines that have been turned on to handle the increased load are automatically stopped.
For cloud services, it's a big problem to update functionality and system software. Sometimes they are forced to temporarily turn off the service, warn users, carry out work at night. Our architecture allows us to do that so that users don't even notice it.
WebRTC technology: calls, video calls, telephony
Video calls in Bitrix24 are private. Reliable video conversations within the company are based on WebRTC technology. The connection is encrypted, calls are made between users as peer-to-peer, the process is almost transparent and is held "inside" the browser.
Signaling performs three simple tasks:
- Interconnect configurations of two browsers (audio/video streams, codecs, addresses, ports in SDP format).
- Exchange passwords to establish an encrypted connection between browsers.
- Actions initiation - call to somebody (connect the stream of client A to the stream of client B on callbacks in js), end call, etc.
Browsers connect to each other using Signaling, which allows you to make video calls.
Calling is easy when both users are in the same local network. But when users are in different networks and have configured firewalls, browsers cannot establish a connection without the help:
- To pass the company's firewalls, users access the central server using STUN/TURN protocols.
- If it's impossible to pass the firewall, the media streams go through a third-party server, not peer-to-peer between browsers (in "relay" mode).
Group video calls
When a group video call is performed, each browser holds the video stream of each participant using WebRTC.
WebRTC and telephony
Bitrix24 integrates with "gateways" to make calls to regular phone numbers from/to the company.
Bitrix24 uses a unique technology to enhance the service performance that unites the high-speed loading of static data and background preparation of dynamic data.
Pages are divided into two sections: static and dynamic. The static part is cached and is displayed immediately. The dynamic part is loaded using background processing and cached in the browser.
What Bitrix24 API is open for:
- Social network groups (workgroups, projects)
- Data storage (information blocks)
- Notifications and Activity Stream
- Users and Departments
- Chat Bots and Open Channels