Hello all,<br><br>I am currently working on building a portable SIMD programming model(SIMD within a register) for various platforms, like x86, ARM and so on. I know that LLVM supports SIMD programming on many architectures, so I am thinking of using LLVM to help build this model but I not quite sure whether LLVM can help.<br>

<br>The following describes the details of the model and also my current 

approach for this model. <br><br><font size="2"><b>Background</b></font><br>Almost modern processor families support SIMD instruction sets but the instruction set designs for each platform have different combinations of operations. The portable SIMD here is to make an uniform system of SIMD operations at all power-of-2 field widths.<br>

For example, for simd_add on SSE2, I want to have all the following operations supported, simd<2^x>::add(a, b) for 2<=2^x<=register_size(register_size equals 128 in SSE2). However, SSE2 only supports simd_add on field widths of 8, 16, 32, or 64. Hence, simulation for simd_add on field widths of 2, 4 and 128 is needed. <b>At mean time, I want the simulation to be as good as possible</b>.<br>

<br><b>An iteration Approach</b><br>My current approach of getting the best simulation is a strategy-based iteration algorithm. For example, simd<4>::add can be implemented based on simd<8>::add in the following manner.<br>

<br>===strategy_add_doubling===<br>1. hiMask = simd<8>::himask() //this is a constant with high 4 bits of each 8-bits field are 1111 and low 4 bits of each 8-bits field are 0000.<br>2. loMask = simd<8>::lowmask() //this is a constant with high 4 bits of each 8-bits field are 0000 and low 4 bits of each 8-bits field are 1111.<br>

3. c = simd<8>::add(simd_and(loMask, a), simd_and(loMask, b))<br>4. d = simd<8>::add(simd_and(hiMask, a), simd_and(hiMask, b))<br>5. return simd_or(d, simd_and(loMask, c)) // get the answer for simd<4>::add(a, b)<br>

<br>Also, it is not hard to see that simd<fw>::add can be implemented based on simd<fw/2>::add. The key is getting the correct carry information and constructing that carry mask.<br>===strategy_add_halving===<br>

1. c = simd<fw/2>::add(a, b)<br>2. get the carry mask d //the mask should look like ...00010000...00000000...00010000...<br>3. return simd<fw/2>::add(d, c)<br><br>As you can see, for a certain simd operation, there are many ways/strategies to simulate it, like simd<fw>::add, it can be simulated based on simd<2*fw>::add and also based on simd<fw/2>::add.<br>

<br>Then, we can construct as many strategies as possible for each simd operation we wanna support. Based on all the strategies we have, an iteration algorithm can be performed to get the best simulation for each simd operation. The criteria here for evaulation is the number of instructions used in a simulation, so the best simulation will have the least number of instructions.<br>

<br><b>Can I use LLVM to improve the above approach or even get a new approach?</b><br><br>Since LLVM supports various architectures and captures the system resources very well, I am wondering whether LLVM could be used to help the project. I went through the LLVM documentation, but I am still stucking here on how to incorporate LLVM into my approach. Below are some of my thoughts.<br>

<br>1. The strategy-based approach doesn't capture the target resources very well in terms of CPU cycles for operations, instruction selection, etc. There might be a case that two different simulations for a certain operation have the same number of instructions, the strategy-based approach can't handle this case very well(It just randomly picks one).<br>

2. The strategies are written by human, so they are not carefully tuned to achieve best performance. There are quite a lot of room for improving the performance.<br>3. Can LLVM be incorporated into the strategy-based approach so that LLVM provides additional information for getting the best simulation?<br>

4. Instead of using the strategy-based approach, can we utilize LLVM to get a new approach to achieve the goal?<br><br>I know that LLVM has auto-vectorization and can generate SIMD instructions for scalar codes. But I don't know the details how LLVM captures the target resources and generates good instructions especially the SIMD instructions on different platforms.<br>

<br>Any suggestions or comments are appreciated.<br><br>I look forward to hearing from you.<br><br>Thanks a lot,<br>Kevin<br>