This example shows the optimal selection of block size. Based on CUDA Pro Tip: Occupancy API Simplifies Launch Configuration.
CUDA Toolkit and Drivers.
Open a terminal and type:
sh run.sh
A typical output should look like this one.
Grid size is 977, array count is 1000000, min grid size is 48
Device maxThreadsPerMultiProcessor 2048
Device warpSize 32
Launched blocks of size 1024. Theoretical occupancy: 1.000000
Data is correct