which technology hides the existence of data This is a topic that many people are looking for. bluevelvetrestaurant.com is a channel providing useful information about learning, life, digital marketing and online courses …. it will help you have an overview and solid multi-faceted knowledge . Today, bluevelvetrestaurant.com would like to introduce to you USENIX ATC 19 – INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive. Following along are instructions in the video below:
“Um hi. My name is zane and today. I m going to present. A work work insider.
Which is a in storage companion system for emerging. High performance drive. And is a joint. Work.
With my collaborators. Tom d. And jason come from ucla. Story.
Technology. Has been improved significantly in the last decade. We witness moore s law of destroyed drive that the drive bandwidth doubles every two years. However meanwhile.
We found the performance of hosts driving the connections as no skill will as we saw in the figure. The existence of this performance gap actually prevents us from leveraging. The fast on the line drive make this advance story technology food tell to come back with this in connection bottleneck. Researchers proposing storage computing architecture in which hosts partially of loads computation to the storage rack.
One simple. Illustrative workload. For this is sequel query. So here.
We could offload the few doing part directly happening side of store drive to take advantage of the ample internal drive bandwidth and after that the data volume is reduced. So that only less data needs to be transferred back to host. Therefore alleviating the interconnection bottleneck installed. Computing is a promising solution however design a full stack in stored computing systems.
Challenging here we ll take a look at every layers of the season stack. I will move on that existing work has different limitation differently and here i m going to discuss them in a button up direction. So first in the hardware layer the insert computing unit. Has either limited performance of flexibility basically there are two types of one strand company you need and first time.
It s arm based. It is fully programmable and can sun support generation computing. However is completed commutation capabilities insufficient to saturate high internal drive bandwidth which chris yet another system bottleneck the second high basic phase. Which is the customize hardware designed for specific workloads you can achieve very high performance on that were close.
However it is not programmable this means that after it is manufactured it can only support those workloads second in the system runtime. Some crucial system supports a missing making them less usable in a practical scenario specifically. It has to support protection because we cannot really trust on the line of loaded program. Which may issue arbitrary drive accessing requests to access or even manipulate over the uh.
No threat drug data. And the addition went to support virtualization sings in your pratt. Seems in practical. The store drivers shared among multiple users of processes and more importantly.
A single drive program may not fully utilize high internal drive. Bandwidth. Therefore. It is preferable to co locate multiple try program.
Simultaneously finally in the program layer. They lack of an effective obstruction existing work exposes. Extremely ip is which are not integrated. Where the existence is an interface therefore they require considerable host code modifications to address those issues.
We pose. A new system called insider. Before we dive into system design. Let s first take a quick overview of approach first in the hardware layer inside our adopts fpga.
Which is a reconfigure architecture for both performance and flexibility and we were showing the evaluation that insider is able to achieve a major improvement in performance and cost efficiency compared with our amazing start companion system second to provide cruiser system. Supports like protection and virtualization in the system. Rom entire layer would build a separate control plane to enforce important system policies like permission check and resource. Scheduling and finally in the program layer.
We ll provide a file based abstraction for instruct computing. We expose posix like ap is which are familiar to host users and in the evaluation. We will show that much less programming effort is required okay after taking overview. Let s now take a look at the insider system design.
So choosing a proper instant computing. You need this crucial photo versus an efficiency. There are so many architecture candidates like gpu arm x86 basic fpga. It seems hard make a decision to guide us lecture.
We first analyze the requirement of insert computing union first it has to have program ability to support january start computing rather than just few specific applause second. It has to have massive parallelism to saturate the high internal bandwidth. Otherwise. It will become the bottleneck of the whole system finally each rod should have high energy efficiency.
That s because for device. A original energy fission and we don t want to compromise that significantly after introducing in the insert computing unit based on those requirement we evaluate multi mode. Architecture candidates and the results are shown in this table compared with asic epi j. Sacrificed.
Some performance and energy efficiency. However it has much better program ability to support genuine stone computing. Compared with arm and x86 apg has better parallelism for performance and heiner higher energy efficiency. Screaming work laws like string matching has plenty of pipeline level parallelism.
Which is supported by a pg. But not by gpu. Any addition fpga has better energy efficiency compared with gpu therefore in our system. We adopt fpga as i started competing even now let s apply it into our system.
I will first start with an initial set architecture and people refining it into the final architecture. First let s put. Let s add program ability to the storage drive. We add a pd shape into a storage controller.
Which is able to perform installed computing competition to leverage the triode programming capability. The hosts first offload computational logic spiral program in fpga chip to retrieve the necessary data. The driver program acts of issue drive accessing request. Which contains logical block addresses or aopa synthroid.
The drive phone were then translated into physical block addresses or pbs in short and access. The storage union accordingly. The response data is then read by the try program to performing stored. Computation and finally the output result is sent back to host through dma.
So far. The design seems to be fine. However some crucial system supports are missing here first it lacks of protection since the offloaded drive programming to arbitrary drive. Accessing crest.
2. Access or even manipulate over the author at drive data. This can be exploited to compromise. The system the root cause.
Here is that our initial system picture. Only contains the data plane to perform in stored computation. It does not have a control plane to enforce this and policy and an insider. We build a separate control plane to support those missing pieces and features first.
We are going to make this drive program compute. Only that means. It is now low longer able to perform the direct drive access instead this is now handled by and they are newly added on chuckling. Let s not take a look at the refined design immediately after offloading step.
The host program will provide the file paths of the instruct installed computing input. And here. The host file system will enforce the file permission check. And deny the unauthorized access after that the corresponding logic.
Block addresses are sent to the drive firmware to issue the actual storage. I o request this steps. I m forced by our trusted insider runtime. Which serves as a centralized proxy among multiple run a multiple host programs.
We can fill the external control plane to support virtualization. That is executing multiple drive programs. Simultaneously. First and foremost.
We need huddle feature to support virtualization. We leverage. Partial reconfiguration offered by modern fpga chip to parse up to partition that pd resource. Evenly into three pieces this enables a multi core fpga and second.
We take back an authority for host program to perform a floating and now it is performed essentially by inside the run time across host programs. However having these techniques allow is not sufficient in addition since now we need to do colocation. It is important to enforce drive bandwidth scheduling. Among those running drive programs here.
We call them drive processes. The drive process has different excluding rate at different time. So. The scheduler should be adaptive and the addition.
The scheduler should be fair to prevent a process from forcefully occupying the try bandwidth. However. We exit cannot intuitively do that at the host side inside a room time that s that s because it s too slow. The pcie round trip time is 1 microsecond.
The slow reaction will store the drive processes and increase. The buffer size therefore an insider. We and we make the decision that partially offload the computation and sorry of load the control plane to fpga to build a hardware based scheduler the scheduler accepts data from a subscribe data and dispatches it to the corresponding thread processes and by leveraging. This dispatch information.
It is able to mind that the tribe and consumption of east drive processor real time. The schedule. Then provide this feedback information to the film work to adjust eastern rates of storage. I will request accordingly for fairness.
We design a scheduling policy that is similar with a deficit round robin in networking q s. And please refer to our paper for further details. Finally. Let s take a look at insiders program model to see how we use it the key idea here is to abstract insert computation at file operations.
Which are familiar to host programmers. The user can register a virtual file based on real fire and the corresponding uninstall computing program. The data of the virtual file corresponds to the output processing. Result of the real file after the registration.
It brings user an illusion that versa fire actually exists in the file system and can be accessed throughout posix like api s this abstraction is simple. But effective one simple example for this is machine learning so usually people were first apply feature selection algorithm to prove the real data and then fitting into the training algorithm like svm and with virtual fire abstraction it is easy to implement an insider. So first user will register a virtual file called post web based on the real fire. And the feature selection program after that the user mostly doesn t need to change the existing spm code.
He only need to pass the file path of this post file into svm function and everything can work smoothly now after introducing the design of insider. Let s now go through the evaluation. We buda installed computing on tribe using a pcie based. A pto board.
The board has on this the to this table show. The evaluation setup. The drive has 64 gigabytes capacity. 5.
Microsecond access and latency. 16. Gigabyte sequential bandwidth. We conduct experiments under both pcie gen3 x.
8. And x 16 settings. But in the following will only show the resulting x8. The host file system.
We use it as is x fs this table shows applications using the evaluation. As well their development effort. Our code is written in c. Here.
On average each application takes tens of lines of host code and hundreds lines of tribe code. We also showed the development on time in terms of person day which ranges from two to nine across applications to make a comparison. We also show that the one effort of an existing work as we ve shown in the figure. It takes thousands of lines of code and over one person month on even for implementing simple operations next.
I m going to compare insider with the arm based on convenience system by replacing the xilinx fpga chip into arm cortex. A 7200 other system components are remaining unchanged. The figure showed the result of throughput. Comparison for arm based solution.
We showed the performance of using one chord to course. Three cores and for course here. We assume a perfect in the cord parallelism. So the multi core.
Result is projected from the single core result simply by multiplying. The number of course compared with the four core case that is orange bar here insider. That is green bar is able to achieve 12x performance on average. And the speed up varies across applications.
Because different application has different computational intensity in the most computing. Ten support load here and inside our chips. 58 x. Performance.
Single query here has the lowest computation intensities. So insider. Does not have any performance gain for this case. In addition.
We show the maximal drive bandwidth using this horizontal red bar. And we can see that insider can actually saturate. The tribe and waits for most cases thereby actually in the optimal performance next. I m going to compare their cost efficiency.
Here. It is defined by throughput per dollar for fair comparison. We showed the whole. We used the wholesale price in the evaluation as a links are like seven fpga board takes.
Sorry apd chip takes thirty seven dollars well aren t colleagues 872 takes ninety five dollars. The results. Here are shown in the results are shown in this. Table insider is able to achieve cost efficiency from 25.
X. To. 104 dynamics. And on average insider is able to shift 31.
X. Cause efficiency. Besides our paper offer more details on evaluation. This including cider verses.
Original. Host only architecture analysis of a pd resource utilization of implemented applications and the evaluation. Which insiders spend with scheduler and so on so definitely check out paper for further details. Finally.
Let me conclude the talk in this talk. We observe a data movement war between host and a host and drive it prevents end user on from leveraging. The advancing story technology to cross this wall. Would present.
A new system coding insider. It achieved three properties first insider. Achieves high end to end performance. And cost efficiency.
Second. Insider. Expose a simple. But effective abstraction for installed computing.
To reduce the host side programming effort finally its controlling design. Enables protection and virtualization for shared. As a clean environment. With this.
I m happy to take questions. Hi. Emerson technion. A great work.
I wondered to two questions actually first. Where does the difference in the source code. The amount of source code between willow and and your work come from and this number one and number two hey. I wonder if you have the evaluation of this workloads compared with regular cpus.
Not in storage. Yeah and so for the first question because on the willow is not open source. So we actually could not on get it try to implement our clothes. So actually we just think the experience out from their paper and try to make a comparison and our point is that on even put email for implementing like single simple operations like sync on simple.
I operation or file appending this kind of operation. Willow steel takes on a large on program effort. But i i mean we cannot really make a comparison on our clothes because we cannot get a try we looked and for the second question so you re asking comparing with the host only architecture balance. I guess i ll actually on do two line timely time limitation.
We actually do not including the plantation. But on the results having a paper. So definitely you can check the paper for this thanks thanks for the great talk i m moving from unist. I was wondering if like you mentioned that you should have like extend the control control from terrace from the corner.
Because like the former doesn t have any information trade related features. But maybe you could have tried also to offload permission chance also to the fpga which i think will like reduce annie s latency and could improve the performance is there any reason. Why you like didn t try or you couldn t like off road tools. Parmesan cheese to the pg.
Yeah. That s a very good question so um. If you really want to upload this part that pta then basically you need to decode the filesystem metadata and it is actually difficult to achieve at apj because the implementation would requires lots of mpd resource. And also it is hard to um be compatible across on host file systems.
So here we make the decision pile in forth data on whole side by leveraging. The knowledge of host file system and because this is more like a one time affair. So big on just start at the beginning of the computation. You check the permission and later on you don t need to do that anymore.
So the overhead is actually minimal. I see thank you thanks let s second speak again. ” ..
Thank you for watching all the articles on the topic USENIX ATC 19 – INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive. All shares of bluevelvetrestaurant.com are very good. We hope you are satisfied with the article. For any questions, please leave a comment below. Hopefully you guys support our website even more.