2011-09-26[SCSI] cxgb3i: convert cdev->l2opt to use rcu to prevent NULL dereferenceNeil Horman1-5/+5
This oops was reported recently: d:mon> e cpu 0xd: Vector: 300 (Data Access) at [c0000000fd4c7120] pc: d00000000076f194: .t3_l2t_get+0x44/0x524 [cxgb3] lr: d000000000b02108: .init_act_open+0x150/0x3d4 [cxgb3i] sp: c0000000fd4c73a0 msr: 8000000000009032 dar: 0 dsisr: 40000000 current = 0xc0000000fd640d40 paca = 0xc00000000054ff80 pid = 5085, comm = iscsid d:mon> t [c0000000fd4c7450] d000000000b02108 .init_act_open+0x150/0x3d4 [cxgb3i] [c0000000fd4c7500] d000000000e45378 .cxgbi_ep_connect+0x784/0x8e8 [libcxgbi] [c0000000fd4c7650] d000000000db33f0 .iscsi_if_rx+0x71c/0xb18 [scsi_transport_iscsi2] [c0000000fd4c7740] c000000000370c9c .netlink_data_ready+0x40/0xa4 [c0000000fd4c77c0] c00000000036f010 .netlink_sendskb+0x4c/0x9c [c0000000fd4c7850] c000000000370c18 .netlink_sendmsg+0x358/0x39c [c0000000fd4c7950] c00000000033be24 .sock_sendmsg+0x114/0x1b8 [c0000000fd4c7b50] c00000000033d208 .sys_sendmsg+0x218/0x2ac [c0000000fd4c7d70] c00000000033f55c .sys_socketcall+0x228/0x27c [c0000000fd4c7e30] c0000000000086a4 syscall_exit+0x0/0x40 --- Exception: c01 (System Call) at 00000080da560cfc The root cause was an EEH error, which sent us down the offload_close path in the cxgb3 driver, which in turn sets cdev->l2opt to NULL, without regard for upper layer driver (like the cxgbi drivers) which might have execution contexts in the middle of its use. The result is the oops above, when t3_l2t_get attempts to dereference L2DATA(cdev)->nentries in arp_hash right after the EEH error handler sets it to NULL. The fix is to prevent the setting of the NULL pointer until after there are no further users of it. The t3cdev->l2opt pointer is now converted to be an rcu pointer and the L2DATA macro is now called under the protection of the rcu_read_lock(). When the EEH error path: t3_adapter_error->offload_close->cxgb3_offload_deactivate Is exectured, setting of that l2opt pointer to NULL, is now gated on an rcu quiescence point, preventing, allowing L2DATA callers to safely check for a NULL pointer without concern that the underlying data will be freeded before the pointer is dereferenced. This has been tested by the reporter and shown to fix the reproted oops [nhorman: fix up unitinialised variable reported by Dan Carpenter] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Karen Xie <kxie@chelsio.com> Cc: stable@kernel.org Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2011-07-26atomic: use <linux/atomic.h>Arun Sharma3-3/+3
This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-22Merge branch 'for-linus' of ↵Linus Torvalds32-1042/+629
git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (26 commits) IB/qib: Defer HCA error events to tasklet mlx4_core: Bump the driver version to 1.0 RDMA/cxgb4: Use printk_ratelimited() instead of printk_ratelimit() IB/mlx4: Support PMA counters for IBoE IB/mlx4: Use flow counters on IBoE ports IB/pma: Add include file for IBA performance counters definitions mlx4_core: Add network flow counters mlx4_core: Fix location of counter index in QP context struct mlx4_core: Read extended capabilities into the flags field mlx4_core: Extend capability flags to 64 bits IB/mlx4: Generate GID change events in IBoE code IB/core: Add GID change event RDMA/cma: Don't allow IPoIB port space for IBoE RDMA: Allow for NULL .modify_device() and .modify_port() methods IB/qib: Update active link width IB/qib: Fix potential deadlock with link down interrupt IB/qib: Add sysfs interface to read free contexts IB/mthca: Remove unnecessary read of PCI_CAP_ID_EXP IB/qib: Remove double define IB/qib: Remove unnecessary read of PCI_CAP_ID_EXP ...
2011-07-22Merge branches 'cma', 'cxgb4', 'ipath', 'misc', 'mlx4', 'mthca', 'qib' and ↵Roland Dreier32-1042/+629
'srp' into for-next
2011-07-22IB/qib: Defer HCA error events to taskletMike Marciniszyn2-21/+53
With ib_qib options: options ib_qib krcvqs=1 pcie_caps=0x51 rcvhdrcnt=4096 singleport=1 ibmtu=4 a run of ib_write_bw -a yields the following: ------------------------------------------------------------------ #bytes #iterations BW peak[MB/sec] BW average[MB/sec] 1048576 5000 2910.64 229.80 ------------------------------------------------------------------ The top cpu use in a profile is: CPU: Intel Architectural Perfmon, speed 2400.15 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 1002300 Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000 samples % samples % app name symbol name 15237 29.2642 964 17.1195 ib_qib.ko qib_7322intr 12320 23.6618 1040 18.4692 ib_qib.ko handle_7322_errors 4106 7.8860 0 0 vmlinux vsnprintf Analysis of the stats, profile, the code, and the annotated profile indicate: - All of the overflow interrupts (one per packet overflow) are serviced on CPU0 with no mitigation on the frequency. - All of the receive interrupts are being serviced by CPU0. (That is the way truescale.cmds statically allocates the kctx IRQs to CPU) - The code is spending all of its time servicing QIB_I_C_ERROR RcvEgrFullErr interrupts on CPU0, starving the packet receive processing. - The decode_err routine is very inefficient, using a printf variant to format a "%s" and continues to loop when the errs mask has been cleared. - Both qib_7322intr and handle_7322_errors read pci registers, which is very inefficient. The fix does the following: - Adds a tasklet to service QIB_I_C_ERROR - Replaces the very inefficient scnprintf() with a memcpy(). A field is added to qib_hwerror_msgs to save the sizeof("string") at compile time so that a strlen is not needed during err_decode(). - The most frequent errors (Overflows) are serviced first to exit the loop as early as possible. - The loop now exits as soon as the errs mask is clear rather than fruitlessly looping through the msp array. With this fix the performance changes to: ------------------------------------------------------------------ #bytes #iterations BW peak[MB/sec] BW average[MB/sec] 1048576 5000 2990.64 2941.35 ------------------------------------------------------------------ During testing of the error handling overflow patch, it was determined that some CPU's were slower when servicing both overflow and receive interrupts on CPU0 with different MSI interrupt vectors. This patch adds an option (krcvq01_no_msi) to not use a dedicated MSI interrupt for kctx's < 2 and to service them on the default interrupt. For some CPUs, the cost of the interrupt enter/exit is more costly than then the additional PCI read in the default handler. Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-21nes: do vlan cleanupJiri Pirko3-28/+45
- unify vlan and nonvlan rx path - kill nesvnic->vlan_grp and nes_netdev_vlan_rx_register - allow to turn on/off rx/tx vlan accel via ethtool (set_features) Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-18RDMA/cxgb4: Use printk_ratelimited() instead of printk_ratelimit()Manuel Zerpies1-4/+5
Since printk_ratelimit() shouldn't be used anymore (see comment in include/linux/printk.h), replace it with printk_ratelimited(). Signed-off-by: Manuel Zerpies <manuel.f.zerpies@ww.stud.uni-erlangen.de> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18IB/mlx4: Support PMA counters for IBoEOr Gerlitz1-1/+67
Use the per port counter attached to all QPs created on that port to implement port level packets/bytes performance counters a la IB. Derived from a patch by Eli Cohen <eli@mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18IB/mlx4: Use flow counters on IBoE portsOr Gerlitz3-3/+27
Allocate flow counter per Ethernet/IBoE port, and attach this counter to all the QPs created on that port. Based on patch by Eli Cohen <eli@mellanox.co.il>. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18IB/pma: Add include file for IBA performance counters definitionsOr Gerlitz3-344/+75
Move the various definitions and mad structures needed for software implementation of IBA PM agent from the ipath and qib drivers into a single include file, which in turn could be used by more consumers. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18IB/mlx4: Generate GID change events in IBoE codeOr Gerlitz1-1/+1
IBoE doesn't use LIDs. Use the GID change event to update the IB core cache for addition/deletion of GIDs. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18RDMA: Allow for NULL .modify_device() and .modify_port() methodsBart Van Assche4-36/+0
These methods don't make sense for iWARP devices, so rather than forcing them to implement stubs, just return -ENOSYS in the core if the hardware driver doesn't set .modify_device and/or .modify_port. Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18IB/qib: Update active link widthMitko Haralanov1-3/+23
Update the active link width on QLE7220 chips when link goes down if chip width does not match shadowed width. Signed-off-by: Mitko Haralanov <mitko@qlogic.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18IB/qib: Fix potential deadlock with link down interruptRam Vepa1-2/+3
There is a possibility of a deadlock due to the way locks are acquired and released in qib_set_uevent_bits(). The function qib_set_uevent_bits() is called in process context and it uses spin_lock() and spin_unlock(). This same lock is acquired/released in interrupt context which can lead to a deadlock when running on the same cpu. The fix is to replace spin_lock() and spin_unlock() with spin_lock_irqsave() and spin_unlock_irqrestore() respectively in qib_set_uevent_bits(). Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18IB/qib: Add sysfs interface to read free contextsRam Vepa1-0/+14
Indicate the number of free user contexts via the sysfs file /sys/class/infiniband/qib0/nfreectxts as required for PSM. Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18IB/mthca: Remove unnecessary read of PCI_CAP_ID_EXPJon Mason2-2/+2
The PCIE capability offset is saved during PCI bus walking. It will remove an unnecessary search in the PCI configuration space if this value is referenced instead of reacquiring it. Also, pci_is_pcie is a better way of determining if the device is PCIE or not (as it uses the same saved PCIE capability offset). Signed-off-by: Jon Mason <jdmason@kudzu.us> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18IB/qib: Remove double defineEdwin van Vliet1-1/+0
Signed-off-by: Edwin van Vliet <edwin@cheatah.nl> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18IB/qib: Remove unnecessary read of PCI_CAP_ID_EXPJon Mason1-4/+4
The PCIE capability offset is saved during PCI bus walking. It will remove an unnecessary search in the PCI configuration space if this value is referenced instead of reacquiring it. Signed-off-by: Jon Mason <jdmason@kudzu.us> Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18IB/ipath: Convert old cpumask api into new oneMotohiro KOSAKI1-4/+7
Adapt to new api. We plan to remove old one later. Almost all changes are trivial, but there is one real fix: the following code is unsafe: int ncpus = num_online_cpus() for (i = 0; i < ncpus; i++) { .. } because 1) we don't guarantee last bit of online cpus is equal to num_online_cpus(). some arch assign sparse cpu number. 2) cpu hotplugging may change cpu_online_mask at same time. we need to pin it by get_online_cpus(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-18IB/qib: Convert old cpumask api into new oneMotohiro KOSAKI1-5/+6
Adapt to use new APIs. We plan to remove old one later and plan to change current->cpus_allowed implementation. No functional change. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-07-17net: Abstract dst->neighbour accesses behind helpers.David S. Miller3-26/+30
dst_{get,set}_neighbour() Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-15IB/mthca: Stop returning separate error and status from FW commandsGoldwyn Rodrigues12-611/+342
Instead of having firmware command functions return an error and also a status, leading to code like: err = mthca_FW_COMMAND(..., &status); if (err) goto out; if (status) { err = -E...; goto out; } all over the place, just handle the FW status inside the FW command handling code (the way mlx4 does it), so we can simply write: err = mthca_FW_COMMAND(...); if (err) goto out; In addition to simplifying the source code, this also saves a healthy chunk of text: add/remove: 0/0 grow/shrink: 10/88 up/down: 510/-3357 (-2847) function old new delta static.trans_table 324 584 +260 mthca_cmd_poll 352 477 +125 mthca_cmd_wait 511 567 +56 mthca_table_put 213 240 +27 mthca_cleanup_db_tab 372 387 +15 __mthca_remove_one 314 323 +9 mthca_cleanup_user_db_tab 275 283 +8 __mthca_init_one 1738 1746 +8 mthca_cleanup 20 21 +1 mthca_MAD_IFC 1081 1082 +1 mthca_MGID_HASH 43 40 -3 mthca_MAP_ICM_AUX 23 20 -3 mthca_MAP_ICM 19 16 -3 mthca_MAP_FA 23 20 -3 mthca_READ_MGM 43 38 -5 mthca_QUERY_SRQ 43 38 -5 mthca_QUERY_QP 59 54 -5 mthca_HW2SW_SRQ 43 38 -5 mthca_HW2SW_MPT 60 55 -5 mthca_HW2SW_EQ 43 38 -5 mthca_HW2SW_CQ 43 38 -5 mthca_free_icm_table 120 114 -6 mthca_query_srq 214 206 -8 mthca_free_qp 662 654 -8 mthca_cmd 38 28 -10 mthca_alloc_db 1321 1311 -10 mthca_setup_hca 1067 1055 -12 mthca_WRITE_MTT 35 22 -13 mthca_WRITE_MGM 40 27 -13 mthca_UNMAP_ICM_AUX 36 23 -13 mthca_UNMAP_FA 36 23 -13 mthca_SYS_DIS 36 23 -13 mthca_SYNC_TPT 36 23 -13 mthca_SW2HW_SRQ 35 22 -13 mthca_SW2HW_MPT 35 22 -13 mthca_SW2HW_EQ 35 22 -13 mthca_SW2HW_CQ 35 22 -13 mthca_RUN_FW 36 23 -13 mthca_DISABLE_LAM 36 23 -13 mthca_CLOSE_IB 36 23 -13 mthca_CLOSE_HCA 38 25 -13 mthca_ARM_SRQ 39 26 -13 mthca_free_icms 178 164 -14 mthca_QUERY_DDR 389 375 -14 mthca_resize_cq 1063 1048 -15 mthca_unmap_eq_icm 123 107 -16 mthca_map_eq_icm 396 380 -16 mthca_cmd_box 90 74 -16 mthca_SET_IB 433 417 -16 mthca_RESIZE_CQ 369 353 -16 mthca_MAP_ICM_page 240 224 -16 mthca_MAP_EQ 183 167 -16 mthca_INIT_IB 473 457 -16 mthca_INIT_HCA 745 729 -16 mthca_map_user_db 816 798 -18 mthca_SYS_EN 157 139 -18 mthca_cleanup_qp_table 78 59 -19 mthca_cleanup_eq_table 168 149 -19 mthca_UNMAP_ICM 143 121 -22 mthca_modify_srq 172 149 -23 mthca_unmap_fmr 198 174 -24 mthca_query_qp 814 790 -24 mthca_query_pkey 343 319 -24 mthca_SET_ICM_SIZE 34 10 -24 mthca_QUERY_DEV_LIM 1870 1846 -24 mthca_map_cmd 1130 1105 -25 mthca_ENABLE_LAM 401 375 -26 mthca_modify_port 247 220 -27 mthca_query_device 884 850 -34 mthca_NOP 75 41 -34 mthca_table_get 287 249 -38 mthca_init_qp_table 333 293 -40 mthca_MODIFY_QP 348 308 -40 mthca_close_hca 131 89 -42 mthca_free_eq 435 390 -45 mthca_query_port 755 705 -50 mthca_free_cq 581 528 -53 mthca_alloc_icm_table 578 524 -54 mthca_multicast_attach 1041 986 -55 mthca_init_hca 326 271 -55 mthca_query_gid 487 431 -56 mthca_free_srq 524 468 -56 mthca_free_mr 168 111 -57 mthca_create_eq 1560 1501 -59 mthca_multicast_detach 790 728 -62 mthca_write_mtt 918 854 -64 mthca_register_device 1406 1342 -64 mthca_fmr_alloc 947 883 -64 mthca_mr_alloc 652 582 -70 mthca_process_mad 1242 1164 -78 mthca_dev_lim 910 830 -80 find_mgm 482 400 -82 mthca_modify_qp 3852 3753 -99 mthca_init_cq 1281 1181 -100 mthca_alloc_srq 1719 1610 -109 mthca_init_eq_table 1807 1679 -128 mthca_init_tavor 761 491 -270 mthca_init_arbel 2617 2098 -519 Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
2011-07-05Merge branch 'master' of ↵David S. Miller6-22/+66
2011-06-17Merge branches 'cxgb4' and 'qib' into for-nextRoland Dreier2-8/+23
2011-06-17IB/qib: Ensure that LOS and DFE are being turned offMitko Haralanov2-8/+23
Due to timing, it is possible for the LOS and DFE to remain on. This is due to the link progressing to LinkUP prior to the driver getting the first Status Changed interrupt. By expanding the conditions under which LOS is turned off and DFE timeout is being set, timing is no longer an issue. Signed-off-by: Mitko Haralanov <mitko@qlogic.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-06-17RDMA/cxgb4: Couple of abort fixesSteve Wise2-13/+38
- fix a race where the driver could end up sending a close_con_req after an abort_rpl. In c4iw_ep_disconnect(), send abort or close request with the ep mutex held. - fix a hang where driver fails to wake up when a connection is reset during a normal close. Wake up any waiters in the interrupt path, and correctly cleanup after rdma_fini() failures. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-06-17RDMA/cxgb4: Don't truncate MR lengthsSteve Wise1-1/+1
Remove left-over code from T3 that limited MR sizes to 32b. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-06-17RDMA/cxgb4: Don't exceed hw IQ depth limit for user CQsSteve Wise1-0/+4
Memory allocated for user CQs gets rounded up to the next page boundary. And after rounding, we recalculate the resulting IQ depth and we need to make sure we don't exceed the HW limits. This bug can result a much smaller CQ allocated than was expected if the HW size field is exceeded, resulting in CQ overflow failures. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-06-06net: remove interrupt.h inclusion from netdevice.hAlexey Dobriyan1-0/+1
* remove interrupt.g inclusion from netdevice.h -- not needed * fixup fallout, add interrupt.h and hardirq.h back where needed. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-26Merge branch 'for-linus' of ↵Linus Torvalds6-28/+30
git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: RDMA/cma: Save PID of ID's owner RDMA/cma: Add support for netlink statistics export RDMA/cma: Pass QP type into rdma_create_id() RDMA: Update exported headers list RDMA/cma: Export enum cma_state in <rdma/rdma_cm.h> RDMA/nes: Add a check for strict_strtoul() RDMA/cxgb3: Don't post zero-byte read if endpoint is going away RDMA/cxgb4: Use completion objects for event blocking IB/srp: Fix integer -> pointer cast warnings IB: Add devnode methods to cm_class and umad_class IB/mad: Return EPROTONOSUPPORT when an RDMA device lacks the QP required IB/uverbs: Add devnode method to set path/mode RDMA/ucma: Add .nodename/.mode to tell userspace where to create device node RDMA: Add netlink infrastructure RDMA: Add error handling to ib_core_init()
2011-05-25Merge branches 'cma', 'cxgb3', 'cxgb4', 'misc', 'nes', 'netlink', 'srp' and ↵Roland Dreier6-28/+30
'uverbs' into for-next
2011-05-24RDMA/nes: Add a check for strict_strtoul()Liu Yuan1-1/+3
It should check if strict_strtoul() succeeds before using 'wqm_quanta_value'. Signed-off-by: Liu Yuan <tailai.ly@taobao.com> [ Convert to kstrtoul() directly while we're here. - Roland ] Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-24RDMA/cxgb3: Don't post zero-byte read if endpoint is going awaySteve Wise3-13/+21
tx_ack() wasn't checking the endpoint state and consequently would attempt to post the p2p 0B read on an endpoint/QP that is closing or aborting. This causes a NULL pointer dereference crash. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-24RDMA/cxgb4: Use completion objects for event blockingSteve Wise1-13/+5
There exists a race condition when using wait_queue_head_t objects that are declared on the stack. This was being done in a few places where we are sending work requests to the FW and awaiting replies, but we don't have an endpoint structure with an embedded c4iw_wr_wait struct. So the code was allocating it locally on the stack. Bad design. The race is: 1) thread on cpuX declares the wait_queue_head_t on the stack, then posts a firmware WR with that wait object ptr as the cookie to be returned in the WR reply. This thread will proceed to block in wait_event_timeout() but before it does: 2) An interrupt runs on cpuY with the WR reply. fw6_msg() handles this and calls c4iw_wake_up(). c4iw_wake_up() sets the condition variable in the c4iw_wr_wait object to TRUE and will call wake_up(), but before it calls wake_up(): 3) The thread on cpuX calls c4iw_wait_for_reply(), which calls wait_event_timeout(). The wait_event_timeout() macro checks the condition variable and returns immediately since it is TRUE. So this thread never blocks/sleeps. The function then returns effectively deallocating the c4iw_wr_wait object that was on the stack. 4) So at this point cpuY has a pointer to the c4iw_wr_wait object that is no longer valid. Further its pointing to a stack frame that might now be in use by some other context/thread. So cpuY continues execution and calls wake_up() on a ptr to a wait object that as been effectively deallocated. This race, when it hits, can cause a crash in wake_up(), which I've seen under heavy stress. It can also corrupt the referenced stack which can cause any number of failures. The fix: Use struct completion, which supports on-stack declarations. Completions use a spinlock around setting the condition to true and the wake up so that steps 2 and 4 above are atomic and step 3 can never happen in-between. Signed-off-by: Steve Wise <swise@opengridcomputing.com>
2011-05-22Add appropriate <linux/prefetch.h> include for prefetch usersPaul Gortmaker1-0/+1
After discovering that wide use of prefetch on modern CPUs could be a net loss instead of a win, net drivers which were relying on the implicit inclusion of prefetch.h via the list headers showed up in the resulting cleanup fallout. Give them an explicit include via the following $0.02 script. ========================================= #!/bin/bash MANUAL="" for i in `git grep -l 'prefetch(.*)' .` ; do grep -q '<linux/prefetch.h>' $i if [ $? = 0 ] ; then continue fi ( echo '?^#include <linux/?a' echo '#include <linux/prefetch.h>' echo . echo w echo q ) | ed -s $i > /dev/null 2>&1 if [ $? != 0 ]; then echo $i needs manual fixup MANUAL="$i $MANUAL" fi done echo ------------------- 8\<---------------------- echo vi $MANUAL ========================================= Signed-off-by: Paul <paul.gortmaker@windriver.com> [ Fixed up some incorrect #include placements, and added some non-network drivers and the fib_trie.c case - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6Linus Torvalds5-58/+13
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits) macvlan: fix panic if lowerdev in a bond tg3: Add braces around 5906 workaround. tg3: Fix NETIF_F_LOOPBACK error macvlan: remove one synchronize_rcu() call networking: NET_CLS_ROUTE4 depends on INET irda: Fix error propagation in ircomm_lmp_connect_response() irda: Kill set but unused variable 'bytes' in irlan_check_command_param() irda: Kill set but unused variable 'clen' in ircomm_connect_indication() rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport() be2net: Kill set but unused variable 'req' in lancer_fw_download() irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication() atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined. rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer(). rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler() rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection() rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window() pkt_sched: Kill set but unused variable 'protocol' in tc_classify() isdn: capi: Use pr_debug() instead of ifdefs. tg3: Update version to 3.119 tg3: Apply rx_discards fix to 5719/5720 ... Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c as per Davem.
2011-05-20RDMA: Add netlink infrastructureRoland Dreier1-1/+1
Add basic RDMA netlink infrastructure that allows for registration of RDMA clients for which data is to be exported and supplies message construction callbacks. Signed-off-by: Nir Muchtar <nirm@voltaire.com> [ Reorganize a few things, add CONFIG_NET dependency. - Roland ] Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-20Merge remote branch 'origin/master' into mergeBenjamin Herrenschmidt10-120/+117
Manual merge of arch/powerpc/kernel/smp.c and add missing scheduler_ipi() call to arch/powerpc/platforms/cell/interrupt.c Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-05-19Merge branch 'merge' into nextBenjamin Herrenschmidt3-3/+3
2011-05-12Merge branches 'cma', 'cxgb4' and 'qib' into for-nextRoland Dreier8-111/+108
2011-05-12IB/qib: Use pci_dev->revisionSergei Shtylyov1-4/+1
The driver reads PCI revision ID from the PCI configuration register while it's already stored by PCI subsystem in the revision field of struct pci_dev. Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09RDMA/iwcm: Get rid of enum iw_cm_event_statusRoland Dreier2-9/+9
The IW_CM_EVENT_STATUS_xxx values were used in only a couple of places; cma.c uses -Exxx values instead, and so do the amso1100, cxgb3 and cxgb4 drivers -- only nes was using the enum values (with the mild consequence that all nes connection failures were treated as generic errors rather than reported as timeouts or rejections). We can fix this confusion by getting rid of enum iw_cm_event_status and using a plain int for struct iw_cm_event.status, and converting nes to use -Exxx as the other iWARP drivers do. This also gets rid of the warning drivers/infiniband/core/cma.c: In function 'cma_iw_handler': drivers/infiniband/core/cma.c:1333:3: warning: case value '4294967185' not in enumerated type 'enum iw_cm_event_status' drivers/infiniband/core/cma.c:1336:3: warning: case value '4294967186' not in enumerated type 'enum iw_cm_event_status' drivers/infiniband/core/cma.c:1332:3: warning: case value '4294967192' not in enumerated type 'enum iw_cm_event_status' Signed-off-by: Roland Dreier <roland@purestorage.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sean Hefty <sean.hefty@intel.com> Reviewed-by: Faisal Latif <faisal.latif@intel.com>
2011-05-09IB/ipath: Use pci_dev->revision, againSergei Shtylyov1-8/+1
Commit 44c10138fd4b ("PCI: Change all drivers to use pci_device->revision") already converted this driver to using the revision field of struct pci_dev but commit bb9171448deb ("IB/ipath: Misc changes to prepare for IB7220 introduction") later reverted that change for some strange reason. Restore the change. Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09IB/qib: Prevent driver hang with unprogrammed boardsMitko Haralanov1-1/+2
The time limit test now correctly checks against current jiffies to avoid the hang. Signed-off-by: Mitko Haralanov <mitko@qlogic.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09RDMA/cxgb4: EEH errors can hang the driverSteve Wise3-53/+66
A few more EEH fixes: c4iw_wait_for_reply(): detect fatal EEH condition on timeout and return an error. The iw_cxgb4 driver was only calling ib_deregister_device() on an EEH event followed by a ib_register_device() when the device was reinitialized. However, the RDMA core doesn't allow multiple iterations of register/deregister by the provider. See drivers/infiniband/core/sysfs.c: ib_device_unregister_sysfs() where the kobject ref is held until the device is deallocated in ib_deallocate_device(). Calling deregister adds this kobj reference, and then a subsequent register call will generate a WARN_ON() from the kobject subsystem because the kobject is being initialized but is already initialized with the ref held. So the provider must deregister and dealloc when resetting for an EEH event, then alloc/register to re-initialize. To do this, we cannot use the device ptr as our ULD handle since it will change with each reallocation. This commit adds a ULD context struct which is used as the ULD handle, and then contains the device pointer and other state needed. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09RDMA/cxgb4: Reset wait condition atomicallySteve Wise2-30/+26
The driver was never really waiting for RDMA_WR/FINI completions because the condition variable used to determine if the completion happened was never reset, and this condition variable is reused for both connection setup and teardown. This causes various driver crashes under heavy loads due to releasing resources too early. The fix is to use atomic bits to correctly reset the condition immediately after the completion is detected. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09RDMA/cxgb4: Fix missing parenthesesRoel Kluin1-1/+1
Parens are missing: '|' has a higher presedence than '?'. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Acked-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09RDMA/cxgb4: Initialization errors can cause crashSteve Wise1-3/+3
c4iw_uld_add() must return ERR_PTR() values instead of NULL on failure. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09RDMA/cxgb4: Don't change QP state outside EP lockSteve Wise3-12/+9
Concurrent ingress CLOSE and ULP ABORT operations causes a crash due to a race condition where the close path releases the EP lock and then tries to move the QP state to CLOSED. This must be done inside the EP lock to avoid the race. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-03ipv4: Make caller provide on-stack flow key to ip_route_output_ports().David S. Miller2-2/+4
Signed-off-by: David S. Miller <davem@davemloft.net>

