CUDA __expf intrinsic

I was thinking that maybe the sigmoid operation could be sped up using a CUDA intrinsic, __expf. Before I try to write a custom operation, is tensorflow already doing this? I’ve spent several hours looking through the sourcecode, but because of how the functor declarations are separated from the implementation, I’ve been having trouble figuring this one.

Hi, I think you can follow the same path for the cu files as in: