Parsimon

发表于 2024-05-15 分类于 Paper Reading 阅读次数：

Scalable Tail Latency Estimation for Data Center Networks.

Fast Flow-level Network Simulation

Task #1 Reading & Writing:

1.1. Read https://arxiv.org/pdf/2205.01234.pdf

You can see my reading traces in the annotations of the paper.

1.2. Conclude the key idea of the paper, explain the algorithm, and analyze its sources of speedup and sources of error.

1.2.1 Key Idea & The Algorithm

假设： 1. Data center 中拥塞通常不是同时出现的，而是零散、突发的。 2. 交换机队列的状态近似符合排队论的结果，仅与网络流量有关，所以我们可以独立分析每个队列。

因此，我们可以对路径上每个 link 处的拥塞时间建模。当近似模拟在大规模网络上运行特定工作负载时端到端流量性能的分布时，我们可以聚合路径拥塞分布，以得到总的性能分布。

具体而言，算法步骤如下（做了一定的简化，因为会在后面的数学形式化部分详细谈到）：

Decomposition:
- 生成每个 link 的 workload: 包含所有经过该 link 的 workflow
- 生成 link-level topology
Clustering: 根据每个 link 的 workload，对 workload 相似的 link 聚类。
Simulation: 从每个 cluster 中抽样一个 link 作为代表，对该 link 及其 workload 做 link-level simulation，得到该 link delay 关于 flowsize 的桶分布，该 cluster 中其他 link 的分布与代表 link 相同。
Aggregation: 给定一条具有 size，path 的 workflow，对 path 中的每个 link，query link 的桶分布（用 size）得到对应的 bucket，再用蒙特卡洛采样，从该 bucket 取出一个 delay 作为该 link 的delay。将路径中所有 link 的 delay 累加，就得到该 workload 的 delay。重复多次，就可以得到 flow-level distribution.

1.2.2. Sources of Speedup

将 flow-level simulation 拆解成了独立的 link-level simulation，可以并行化处理，允许扩展模拟网络的大小和处理核心的数量。
在建立 flow-level topology 时，运用了独特的技术，使得 topology 最多有 3-hop，降低了 simulation 的规模。
没有对所有 link 做 link-level simulation，而是做了 clustering，从中选取一个代表做 link-level simulation。
Aggregation 时，没有对路径上的所有分布做 convolution，而是 sample 出一个点做累加。

1.2.3. Sources of Error

Bottleneck fan-in: Parsimon 在构建 link-level topology 时，将 potential source hosts 直接连到了目标 link，并没有模拟 upstream 的实际情况。这导致模拟中的拥塞和排队情况比实际中更重，因为并没有真实的 upstream links 来 smoothing burst flow。另外，这会略微高估 downstream 造成的延迟。
Lack of traffic smoothing: 在真实网络中，前往目标 link 的 cross-traffic 会被 smoothed，但因为 link-level topology 里无 cross-traffic，会在 simulation 时稍微高估目标 link 的排队延迟。
Link-level independence: Parsimon 假设 link-level simulation 可以独立处理，但这忽略了 flow path 上各 hop 的流量相关性，引入了误差。
One bottleneck at a time: 当拥塞是间歇和暂时的，不同时候出现在不同的 link 上时，Parsimon更准确；当拥塞跨越给定路径的多个边缘和核心链路时，准确性较低。导致高估 long flow end-to-end latency。
Clustering: 由于在 simulation 中使用了 clustering，引入了 actual link delay distribution 与 representative link delay distribution 之间的误差。
Sampling: Aggregation 时，是从 link-level daley distribution 中 sample 一个点，而非对整个 delay 做 convolution。

1.3. How would you utilize machine learning algorithms to make flow-level network simulation faster and more precise?

在处理 link-level distribution 时，运用 machine learning algorithm(maybe mlp)，为每个 link train 出一个以 workflow information 为输入，delay 为输出的函数。当针对给定 workflow 做 aggregation 时，使用这个函数来得到该 workflow 在对应 link 的 delay。
改进 clustering 算法，使用诸如 K-means 的算法。

Task #2 Math formulation:

In section 3, the paper shows how Parsimon decomposes the network simulation into a series of link-level simulations, and how it aggregates the result, please formulate this process. (You can summarize the fast link-level simulation algorithm in section 4.1 as a function without further explanation. You are also allowed to ignore some of the technique details in your formulation)

\(\text{Given network topology as undirected graph }G=(V,E).\)

\(\text{Given workflow routes }WF.\)

2.1. Generating Link-Level Workloads

\(\forall link \in E,(u,v)=link.nodes,WL_{<u,v>}=\{wf|wf \in WF \wedge (\exists i<j\text{, s.t. }wf.nodes[i]=u\wedge wf.nodes[j]=v)\}\) \(WL_{<v,u>}=\{wf|wf \in WF \wedge (\exists i>j\text{, s.t. }wf.nodes[i]=u\wedge wf.nodes[j]=v)\}\)

2.2. Generating Link-Level Topologies

\(\forall link\in E, (u,v)=link.nodes\)

\(\text{The link-level toplogy }LinkTopo_{<u,v>}=(V_{<u,v>},E_{<u,v>}).\)

Suppose the traffic through the link \(<u,v>\) originates from sources \(S_{<u,v>}\) and terminates in destinations \(T_{<u,v>}\) .

\(\text{Let }S_{<u,v>}:=\{wf.begin| wf \in WL_{<u,v>}\},wf.begin=wf[0].\)

\(\text{Let }T_{<u,v>}:=\{wf.end|wf \in WL_{<u,v>}\},wf.end=wf[wf.length-1].\)

\(\text{If }u.type=host\wedge v.type=switch,V_{<u,v>}=\{u,v\}\cup T_{<u,v>},E_{<u,v>}=\{newLink(u,v,mainlink)\} \cup \{newLink(v,h,downstream)|h \in T_{<u,v>}\}\text{, function newLink generates a new link object.}\)

\(\text{If }u.type=switch\wedge v.type=switch,V_{<u,v>}=\{u,v\}\cup T_{<u,v>}\cup S_{<u,v>},E_{<u,v>}=\{newLink(u,v,mainlink)\} \cup \{newLink(v,h,downstream)|h \in T_{<u,v>}\}\cup \{ newLink(h,u,upstream)|h\in S_{<u,v>} \}.\)

\(\text{If }u.type=switch\wedge v.type=host,V_{<u,v>}=\{u,v\}\cup S_{<u,v>},E_{<u,v>}=\{newLink(u,v,ismainlink)\} \cup \{newLink(h,u,upstream)|h \in S_{<u,v>}\}.\)

\(LinkTopo_{<v,u>} \text{ would be genareted by the same way above.}\)

2.2.1. Modeling Round-Trip Delay

\(\forall link \in E_{<u,v>}.\)

\(\text{Then the link has the only correspond workflow }wf\text{, s.t. }wf\in WL_{<u,v>}\wedge (\exists i<j, wf.nodes[i]=u \wedge wf.nodes[j]=v).\)

\(\text{Therefore we could got the latency of the link }link.latency=\sum_{k=i}^{j-1}getLink(wf.nodes[k],wf.nodes[k+1]).latency.\)

\(\text{Let }getLink(u,v):=e \text{ if }e\in E \wedge e.nodes=(u,v).\)

2.2.2. Selecting Link Bandwidths

\(\text{We should increase the bandwidth of downstream links.}\)

\(\forall link \in E_{<u,v>}.\)

\(\text{If }link.type=downstream\text{, then }link.bandwidth=addBandwidth(link)\text{, the function addBandwidth would increase the bandwidth of the link.}\)

2.2.3. Correcting for ACK Traffic

This part is too detailed, so I omitted it.

2.3. Post-Processing Link-Level Results

\(\forall link \in E,(u,v)=link.nodes.\)

\(\forall wf \in WL_{<u,v>}.\)

\(\text{We generate the FCT of the workflow in the link, } FCT_{<u,v>,wf}:=Simulate(LinkTopo_{<u,v>},wf).\)

\(\text{Let ideal FCT of the link for the workflow }IdealFCT_{<u,v>,wf}:=link.latency+wf.size/link.bandwidth.\)

\(\text{Let latency of the link for the workflow }Delay_{<u,v>,wf}:=FCT_{<u,v>,wf}-IdealFCT_{<u,v>,wf}\).

2.3.1. Packet-Normalized Delay

\(\text{Let packet-normalized delay }NormDelay_{<u,v>,wf}:=Delay_{<u,v>,wf}/wf.packetsize.\)

2.3.2. Bucketing Distributions

\(\text{Let set of packet-normalized delay for the link of each workflow }NormDelay_{<u,v>}:=\{NormDelay_{<u,v>,wf}|wf \in WL_{<u,v>}\}.\)

\(\text{Given hyper-parameters x,B}\).

For each bucket \(b\), let \(maxf_b\) and \(minf_b\) be the maximum and minimum flow sizes associated with \(b\), respectively, and let \(n_b\) be the number of elements in \(b\). Each bucket \(b\) apart from the last one is locally subject to two constraints:

\[n_b \ge B \text{ and } maxf_b \ge x*minf_b\]

The algorithm to generate buckets and bucketing distribution:

global B,x
def GenerateBuckets(NormDelay):
    BD=[]
    NormDelay.sort(by=wf_size) # sorted by the flow size, not packet size
    nb=0
    maxfb=INT_MIN
    minfb=INT_MAX
    bucket=[]
    for delay in NormDelay:
        bucket.append(delay)
        maxfb=max(delay,maxfb)
        minfb=min(delay,minfb)
        nb=nb+1
        if nb>=B and maxfb>=x*minfb:
            BD.append(copy.deepcopy(bucket))
            bucket=[]
            nb=0
            maxfb=INT_MIN
            minfb=INT_MAX
    return BD

\(\text{Then we get the bucketing distribution of packet-normalized delay for the link }BD_{<u,v>}:=\{Bucket_{<u,v>,i}\}=GenerateBuckets(NormDelay_{<u,v>}).\)

2.4. Aggregating Link-Level Estimates

\(\text{Given }size,S,D \text{ as size, source and destination, then Parsimon compute the path from source to destination }Path=computePath(size,S,D,G).\)

\(\forall i, \vec{v_i}=<u_i,v_i> =Path[i].\)

\(\text{By querying the bucketing distribution by } size\text{, we obtain }\)

\(Bucket_i\in BD_{<u_i,v_i>}\text{, s.t. }size\ge min(BD_{<u_i,v_i>}) \wedge size\le max(BD_{<u_i,v_i>}).\)

\(\text{Then randomly sample a point from the bucket }D_i^* \stackrel{}{\leftarrow} Bucket_i.\)

\(\text{The end-to-end absolute delay } D \text{ is computed as }D=P\sum_{i}D_i^*\text{, where } P \text{ is the input flow size in packets.}\)

Task #3 Programming:

Can you implement and test Algorithm 1 with a sample network? You may also refer to Appendix D for some details.

Please refer to the test code algorithm.ipynb and the source code clustering.