1、 Background Introduction
(1) Characteristics of live streaming service
Internet video live streaming is a form of news media that provides content that is produced and consumed from time to time. Over the years, it has developed into various business forms such as showbiz, games, e-commerce, and sports. The main characteristics are: real-time consumption of content generated in real time, with higher requirements for timeliness; Streaming media content consumes a large amount of bandwidth and requires more stringent network quality requirements; One person produces, multiple people consume, and the bandwidth scale is large. Live streaming CDN is currently the most effective technical approach to address this large-scale distribution scenario. Its main characteristics are to access nearby to provide a good access network environment, and multi-layer aggregation to reduce the distribution pressure on central resources, so as to meet the requirements of large-scale and timeliness of live streaming services.
(2) Difficulties faced by live CDN
Due to the quality and timeliness requirements of live streaming services, CDNs need to find and establish a complete and reliable transmission link in a short time, with certain requirements for link stability. The traditional link quality based strategy that relies on bypass updates is simple, does not consume too much time, and has guaranteed timeliness. However, with increasing cost pressures and higher availability requirements, the disadvantages of link quality based strategies gradually manifest themselves:
One is that there is a certain delay in the timeliness of the first strategy of this bypass strategy, with a delay of about 2-3 minutes, and it is not possible to quickly make corresponding strategic adjustments to sporadic quality degradation or resource bias;
The second is based on a unified set of link quality data, which cannot take into account resource energy efficiency. For some content with low energy efficiency, it is not possible to make corresponding routing strategy adjustments based on cost considerations, and the control of floating costs is not precise enough.
(3) Solution
Based on multi-dimensional data such as resource information, link status, and streaming media information, the distribution efficiency of each stream is accurately calculated, and the calculation granularity is accurate to the stream level. Through comprehensive calculation of energy efficiency and quality, dynamically calculating access and return strategies for each flow is the key to solving difficulties.
The scheduling system controls live broadcast access and routing by collecting and combining resource information, link status, streaming media information, etc., quickly and accurately adjusting strategies, providing multiple strategies such as quality first, cost first, and quality cost balance, providing more accurate and granular control for improving the availability and energy efficiency ratio of the distribution system under the quality indicator evaluation system.
Content platform: The core purpose of a content platform is to improve quality. Only by improving quality and traffic can it be guaranteed. A sophisticated scheduling system can achieve accurate access, improve the accuracy of network coverage, quickly handle short-term failures, and reduce the risk of traffic loss.
CDN: The main goal of CDN is to improve quality and reduce costs. The scheduling system can accurately control the accuracy of the access network, improve access quality, finely schedule traffic, improve resource reuse rates, and reduce floating costs, leading to construction planning and reducing fixed costs.
2、 Main issues and challenges
(1) Timeliness requirements
Access scheduling requires that the service blocking duration should not exceed 50ms
Routing scheduling requires that the blocking duration on the entire path should not exceed 50ms
Streaming media information synchronization delay does not exceed 100ms
Synchronization delay of device information and network quality shall not exceed 10s
① Scheduling delay control
The delay should not exceed 50ms. Considering the transmission delay of the public network itself, there is basically no extra time for other system calls and calculations. It is necessary to prepare a response strategy in advance, and the scheduling access location should be as close to the calling side as possible. Three functions are designed: policy push, policy cache, and asynchronous update.
After the resource scheduling system generates a scheduling policy, the policy push function directly pushes it to the access layer through the push method. The access layer does not actively call other systems, but directly uses the pushed scheduling plan to return it to the business party. The access layer has no business processing delays.
The policy caching function performs memory caching after the access layer receives the push scheduling plan. The local cache does not fall on the disk. Only push or asynchronous updates trigger cache updates, and the scheduling request directly returns cached data.
Asynchronous update is to initiate a request to obtain scheduling data from the resource scheduler actively and regularly on the interface to prevent push failures.
②Information synchronization delay control
Due to the requirement of 100ms for synchronization delay of streaming media information, considering the delay of public network transmission, the timing collection and reporting method cannot meet the delay requirements. Therefore, the event triggered real-time API reporting method is used to synchronize data. Device information and node information are retrieved through interface calls, with low timeliness requirements. A task allocation mechanism is adopted to prevent repeated retrieval of data.
The event is triggered on the edge of events such as start and stop, and the API is synchronously invoked to send streaming media information, ensuring the timeliness of streaming media information synchronization.
API direct connection triggers back-end business processing without going through middleware, saving middleware processing latency.
The task assignment mechanism assigns data query tasks to different service instances through MQ, and each instance is responsible for retrieving data after claiming the task.
(2) Availability requirements
A. No mutual influence between customers
B. Source inquiry scheduling and access scheduling do not affect each other
C. Response to abnormal degradation strategy guarantee
①Interface Availability
Isolation
- User isolation
Customers generally use IDs to distinguish their identities, and some major customers may have independent access domain names. Deploy independent computing resources based on the ID and domain name dimensions to prevent individual customer access from affecting all customers.
Considering cost and availability, in addition to independently deploying resources, key customers also need to deploy corresponding functions in regular clusters to provide active and standby resource protection.
- Business isolation
The business parties involved in back-to-source scheduling and access scheduling differ in their responsiveness to scheduling and exception handling methods, as well as in the scope of impact and benefits of scheduling failures. Therefore, the content platform and CDN are isolated according to the business side. They are located on different access instances, facilitating the expansion of a single service and controlling the scope of service exceptions.
②Current limiting
When the system load is too high, it is necessary to limit the system business flow to protect system services, improve system recovery speed, and reduce system load.
③Fusing
- Concurrent fusing
There are many interface call scenarios in the system, such as unified access to call resource scheduling interfaces to obtain scheduling plans, resource scheduling call information collection to obtain basic data, and so on. In order to ensure the stability of back-end business services and prevent back-end business from being killed by sudden increments, it is necessary to fuse the concurrency of back-end business. After exceeding the rated concurrency, back-end interfaces are no longer allowed to be invoked, and the monitoring system throws exceptions. Front-end business tolerates certain fault requests based on the fault handling mechanism. If the tolerance limit is exceeded, it will degenerate into a roundabout strategy.
- Failure rate fusing
Backend services may be temporarily unavailable, reducing the amount of requests for back-end services when they are temporarily unavailable, and accelerating the recovery of back-end services. When the failure rate of calling a back-end interface is higher than the threshold, the back-end interface will not be called for a period of time. If the rated time is exceeded, the back-end service availability will continue to be detected until the service recovers.
2. Degradation strategy
When performing access and source inquiry scheduling services, it is necessary to degenerate to the default round-trip policy, the access degenerate to the DNS resolution method, and the source inquiry degenerate to the CDN fixed source inquiry policy, no longer relying on the scheduling system to make policy choices.
3、Scheduling system architecture
(1) Business architecture
The scheduling system is divided into five parts: unified access, operation management platform, resource scheduling, information collection, and log system.
- Unified access: Provide centralized standard access capabilities for content platforms. Considering performance and latency consumption, provide sink agent access capabilities for edge instances and secondary source instances of CDNs.
- Operation management platform: As a manual interface, the operation management platform mainly provides configuration capabilities and large data screens.
- Resource scheduling: As the core unit of the scheduling system, resource scheduling outputs different scheduling plans based on multiple input conditions for both access and source inquiry business modes.
- Information collection: Information collection serves as a data base to provide necessary input information such as quality, capability, and location for resource scheduling.
- Log system: The log system provides a sequential recording method for recording scheduling information, mainly for copying and evaluating scheduling policies.
(2)Information collection system
The information collection system serves as the data base of the scheduling system, and its main function is to collect equipment resource information from the operation and maintenance system, collect streaming media information from the business system, and provide it to the resource scheduling for use after data integration.
The information collection system collects equipment operation data through actively and regularly calling the operation and maintenance system interface, including CPU usage, memory usage, disk IO, network IO, and other information, to evaluate the service capability of the equipment. Collect information such as node bandwidth usage to evaluate node bearing capacity.
Collect link quality data through the active timing call monitoring system interface, including RTT, packet loss rate, and other information, to evaluate network quality.
Passively waiting for CDN business instances to report information such as streaming media resource location, downlink concurrency, and congestion rate, which is used to evaluate service quality and service revenue.
After this information is collected, it is classified and integrated to form aggregated data on service capabilities, service quality, and service revenue based on different dimensions such as nodes, regions, operators, and business forms.
The aggregated data will eventually be provided to the resource scheduling system through a query interface.
(3) Resource scheduling system
As the core business module of the scheduling system, resource scheduling mainly collects necessary scheduling basis from information collection, outputs scheduling plans through a set of scheduling strategies, and provides them to access and source inquiry services.
The resource scheduling system mainly sends personalized scheduling configuration information to the resource scheduling system through the operation management platform. Through the query information collection interface, query the required service capabilities, service quality, service revenue, and other information. Generate static scheduling plans by matching different scheduling policies.
Query the location and description information of streaming media resources through the query information collection interface. Generate dynamic scheduling plans by matching scheduling policies.
The final scheduling plan will be provided to business parties in an interface manner.
(4) Technical architecture
(5) Deployment Plan