Friday, 29 July 2011< ^ >
Room Configuration

[13:04:50] Andrew McGregor joins the room
[13:05:13] Wes George joins the room
[13:05:34] <Wes George> agenda
[13:07:01] <Wes George> chairs note that there is a webex session. I do not have the info on it, but slides will be shared there
[13:07:55] <Wes George> discussing
[13:11:17] Benson Schliesser joins the room
[13:12:37] <Wes George> benson, can you please link to the webex in jabber?
[13:14:52] <Wes George> in the last mtg, we had a lot of dc operators, how many present in this mtg?
[13:14:55] <Wes George> a few hands
[13:15:11] <Wes George> history of this wg has been strange - conceived only to study requirement
[13:15:19] <Wes George> scopes have been reduced and tuned down
[13:15:26] <Wes George> are we going to do any work here, or just study the requirements
[13:15:32] <Wes George> benson: depends on how you define work
[13:15:37] <Wes George> requirements is a good bit of work
[13:15:42] <Wes George> protocols? no, wrong area
[13:15:49] <Wes George> q: where are we pursuing solutions
[13:16:05] <Wes George> ron bonica: developingg req.s is necessary first step
[13:16:12] <Wes George> once this is done, protos can't be done in ops area
[13:16:34] <Wes George> if we find protocols that need to be developed, gets pused to ADs to see if existing WGs can solve problem, otherwise charter new ones
[13:16:46] <Wes George> previous speaker was hamachi (ciena)
[13:17:29] <Wes George> tom: let's not get hung up in process on what we do next
[13:17:37] <Wes George> bottom line - need to figure out what the real problems are
[13:17:50] <Wes George> if we start worrying about whether we're chartered to do work, we'll go in circles
[13:17:59] <Wes George> if you're worried about timeliness, the answer is to start
[13:18:08] <Wes George> if it takes us a year to get reqs out the door, we've blown it
[13:18:15] <Wes George> thomas narten
[13:18:51] <Wes George> next slides actually are posted:
[13:19:20] <Wes George> susan hares presenting
[13:19:30] <Wes George> slide 2
[13:19:41] Bill joins the room
[13:19:52] Bill has set the subject to: Webex:
[13:19:52] <Wes George> asked operators for pain points
[13:20:30] <Wes George> slide 3
[13:20:46] <Wes George> adhost - medium size DC case
[13:21:27] <Wes George> most info available online at (agenda for June meeting)
[13:21:36] <Wes George> slide 4
[13:21:49] <Wes George> context - small, not really a problem
[13:22:00] <Wes George> we know large (goog, y! amazon) is a problem
[13:22:22] <Wes George> we also know medium adhost, etc has a different focus
[13:22:25] <Wes George> slide 5
[13:22:39] <Wes George> why L2 in DC?
[13:23:02] <Wes George> customers come to DC providers, "have existing app, don't want to change it"
[13:23:11] <Wes George> willing to pay money to not change
[13:23:26] <Wes George> may want interconnect @ l2 because that's how it works
[13:23:32] <Wes George> slide 6
[13:23:44] <Wes George> go grid example from their website
[13:23:59] <Wes George> as these grow, it doesn't work - arp starts becoming a problem
[13:24:17] <Wes George> need numbers to quantify, but hard to get/publish numbers
[13:24:20] <Wes George> slide 7
[13:24:24] <Wes George> most DCs have multiple tenants
[13:24:28] <Wes George> tenants need L2
[13:24:40] <Wes George> see the problem, don't have manpower/time to look at the problem, but know it's a problem
[13:24:43] <Wes George> (arp_)
[13:24:58] <Wes George> not trying to convince them the biz model is good or bad
[13:25:09] <Wes George> told us they see a problem ,and as much as they know and can disclose about it
[13:25:30] <Wes George> thomas narten:
[13:25:59] <Wes George> make the observation that in my problem state,ment, ruled things at different sites out of scope
[13:26:10] <Wes George> I'd like to understand the issues of L2 spans between sites/implications
[13:26:22] <Wes George> do we have funny boxes in the middle? is it low bw link? high latency?
[13:26:30] <Wes George> trying to understand what that means to the appl
[13:26:47] <Wes George> susan: absolutely - when I asked adhost, they mostly use this in a single-city area
[13:26:55] <Wes George> rob shakir (cw)
[13:27:13] <Wes George> this is an opsarea - we shouldn't scope something, we should base this on problem statements from operations
[13:27:22] <Wes George> latency problem is understandable, but is that an address resolution issue?
[13:27:35] <Wes George> I build VPLS, my net can go from SF,ca to bangkok
[13:27:57] <Wes George> I'm nto sure high latency is an issue we should solve here
[13:28:13] <Wes George> warren - might be useful if you provide a picture of your problem
[13:28:29] <Wes George> rob: problem statement is an interesting discussion, but it really should docment toplogies
[13:28:39] <Wes George> this way we're solving real problems that already exist, not theoretical problems
[13:28:43] <Wes George> i'll volunteer to write
[13:28:49] <Wes George> I'm an operator from the uk
[13:29:06] <Wes George> if you go to NANOG and ask, I can't explain there because my competitors are there
[13:29:11] <Wes George> not exaclty the easiest place to share
[13:29:16] <Wes George> better to share privately and generalize it
[13:29:34] <Wes George> biggest problem I have in getting people to interact with IETF is that they thnk that what we do is irrelevant, interaction is so imporatnt
[13:30:08] <Wes George> benson: don't want toput words in thomas's moutn - when he mentions latency, while it may be a problem for a number of reasons, does addr. resolution behave differently in differing latency?
[13:30:44] <Wes George> ning so from vz: hearing a lot of people tryig to picture dc design in mind - always want to fit one size fits all model. understand Goog and Y! design, that's fine for them
[13:30:53] <Wes George> in multi-tenant, clientelle are very different
[13:31:02] <Wes George> most oftehn alreadya run private DCs for many years
[13:31:11] <Wes George> instead of expanding, they want to offload some of those capacities
[13:31:19] <Wes George> rackspace/cpu power/serviers to a provider
[13:31:23] <Wes George> doesn't mean they'll shut down their own
[13:31:35] <Wes George> will be an indef time period where apps will have to work across both
[13:31:43] <Wes George> thisn kind of arch from my perspective makes perfect senes
[13:31:47] <Wes George> there's no one size fits all
[13:32:14] <Wes George> andrew mgreedy - can switch vendors hands up?
[13:32:15] <Wes George> a few
[13:32:21] <Wes George> this sort of thing happens all of the time
[13:32:33] <Wes George> for the scope, the only impact for this group is what this does for addr resolution
[13:32:44] <Wes George> the presence of the scenario is defitnitely something that should be in scope
[13:32:47] <Wes George> happens often
[13:32:49] <Wes George> cathy aronsen
[13:32:55] <Wes George> have a bit of chicken and egg problem
[13:33:05] <Wes George> need to get provider input, but won't talk about it at nanog
[13:33:17] <Wes George> these guys went to nanog to interact with providers, how else do we do it?
[13:33:18] <Wes George> ideas?
[13:33:45] <Wes George> erik kline - clarifying questions - when it says ARP, did it mean arp or address-resolution more generally?
[13:33:51] <Wes George> address-resolution more generally
[13:34:09] <Wes George> randy bush - operaters feel they gave significant input at nanog and other places, I think we have an ear resolution problem
[13:34:17] <Wes George> benson: good turnout at nanog meeting, good presentations
[13:34:33] <Wes George> I understand that the operatroos were intentionally vague about things, undestand why
[13:34:42] <Wes George> if people want to provide more private feedback, come talk to the chairs
[13:34:52] <Wes George> agree with the things I'm hearing, but need to be put in text that is representative
[13:35:12] <Wes George> warren kumari: in many cases, this seems like "doctor, doctor it hurts when I do this"
[13:35:26] <Wes George> benson: right, just don't do that if it hurts.
[13:35:39] <Wes George> warren - it's not "don't do that" it's "do that more scalably"
[13:35:54] <Wes George> we tried to build large L2 netowrks, we stopped because they tend to fail painfully and spectacularly
[13:36:05] <Wes George> there's a reason the internet isn't an L2 netowrk, so perhaps we need to think about this
[13:36:20] <Wes George> benson: "don't do that" might be a bcp, but we have to be able to articulate well
[13:36:32] <Wes George> warron: solutions you can do instaed- vm through overlays
[13:36:37] <Wes George> clever re subnetting of your network
[13:36:45] <Wes George> poke your vendors, have them help explain
[13:37:15] <Wes George> ning so: to respond to "don't do that" as a network provider- we've known of scalability issues on the network for years. so we tell them don't do that
[13:37:22] <Wes George> they say - that's what I want, and your competitor does it
[13:37:27] <Wes George> we're forced to do it
[13:37:37] <Wes George> we have to force the vendors to push through their limitation
[13:37:49] <Wes George> referring to bgp route table problem on VPN PE router
[13:38:30] <Wes George> benson: if there are diff DC designs that have diff reqs, we probably need mult scenarios
[13:38:38] <Wes George> warron: saying "don't do that" probably not the way to word it
[13:38:50] <Wes George> this is a way to do it that is more stable, more scalable
[13:39:05] <Wes George> to some extent you do have responsibility to look out for and advise your custs
[13:39:29] <Wes George> randy: datacenters are designed diff, everyone has different IPR - wow, holy shit
[13:39:38] <Wes George> let's actually get down to work, or go get coffee
[13:40:14] <Wes George> slides now available:
[13:40:27] <Wes George> now reviewing
[13:40:34] <Wes George> jim rees presenting
[13:40:35] <Wes George> slide 2
[13:40:53] <Wes George> arp/ND used interchangeably
[13:41:04] <Wes George> dino farinacci - did you also study unknown unicast floodingg?
[13:41:24] <Wes George> bridge doesn't know where MAC is, it floods - another class of broadcast
[13:41:38] <Wes George> we do have some spikes in the charts that may be due to that sort of thing
[13:41:54] <Wes George> had some unused blades in the DC, thought we'd generate lots of arp and see what happens
[13:42:01] <Wes George> slide 3
[13:42:26] <Wes George> built an emulater so that we can generate trafffic in controlle dconditions
[13:42:30] <Wes George> slide 4
[13:42:44] <Wes George> testbed - using 1 enclosure for experiment, others have production traffic
[13:42:54] <Wes George> tap is running tcp dump, looking for arp/nd, count packets
[13:42:57] <Wes George> slide 5
[13:43:30] <Wes George> bot controller uses udp broadcasts, doesn't generate any arp traffic on its
[13:43:43] <Wes George> anoop brocade -what's the protocol between the switches
[13:43:46] <Wes George> is it spanning tree?
[13:43:49] <Wes George> a: yes
[13:44:03] <Wes George> vmware has tricks when it's migrating hosts, but this is just spanning tree
[13:44:08] <Wes George> ok so some of these ports get blocked?
[13:44:30] <Wes George> couldn't hear benson's comment to scribe
[13:44:56] <Wes George> dino - are those links l3 or l2 - they're l3
[13:45:02] <Wes George> slide 6 - trafffic gen
[13:45:42] <Wes George> can do full mesh between 500 hosts, or less meshy
[13:45:55] <Wes George> can vary load by varyingg sleep time
[13:45:59] <Wes George> slide 7
[13:46:16] <Wes George> slide 8
[13:47:25] <Wes George> erik kline - any conjecture as to why cpu load is higher for arp than nd?
[13:47:41] <Wes George> a: you would probably be better at conjecturing than me
[13:47:47] <Wes George> erik - don't know enough about the test
[13:48:11] <Wes George> manish - primary diff betweeen the two are the way that they propagate broadcasts
[13:48:16] <Wes George> arp - propagated everywhere
[13:48:30] <Wes George> nd based on multicast, may not be propagated to our measurement point
[13:48:49] <Wes George> eric vynce - what is the size of the network?
[13:48:57] <Wes George> a: /22 for v4, v6 /64
[13:49:10] <Wes George> anoop - cpu load on access swich would be lower for
[13:49:20] <Wes George> ND - switch won't pick up ND msgs not destined to it
[13:49:24] <Wes George> a: this is from the outside
[13:50:06] <Wes George> dino: was the pps to the switch same for v4 vs v6?
[13:50:15] <Wes George> probably should be no since multicast vs broadcast
[13:50:49] <Wes George> v6 1000 pps, v4, 1000-1400
[13:51:00] <Wes George> slide 9
[13:51:08] <Wes George> effect of external scans (single)
[13:51:11] <Wes George> slide 10
[13:51:17] <Wes George> effect of number of hosts
[13:51:54] <Wes George> arp traffic as you scale up hosts si fairly linear
[13:51:58] <Wes George> slide 11
[13:52:04] <Wes George> IIb effect of traffic
[13:52:27] <Wes George> vary number of hosts up - you generate more arp for things that are down
[13:52:43] <Wes George> may be seeing limitations of the vmware setup
[13:52:48] <Wes George> slide 12
[13:52:55] <Wes George> IIC
[13:53:21] <Wes George> draft that goes along with this, that has more numbers, fewer graphs
[13:53:25] <Wes George> slide 13
[13:53:32] <Wes George> all 500 hosts up, varying traffic rate
[13:53:35] kfantonio joins the room
[13:53:51] <Wes George> again may be seeing vmware issues
[13:53:55] <Wes George> slide 14
[13:53:59] <Wes George> effects of machine failures
[13:54:12] <Wes George> start with many, start shutting them down
[13:54:32] <Wes George> burst of traffic while everyone finds each other- then climbs a bit as they ggo away
[13:54:39] <Wes George> slide 15 - effects of vm migrations
[13:54:55] <Wes George> force vms on a blade to move to another blade
[13:55:06] <Wes George> didn't seem to make much difference, may be VMware tricks
[13:55:18] <Wes George> dino - were you counting reverse arps - ethertype 806
[13:55:22] <Wes George> yes
[13:55:31] <Wes George> slide 16 - highest possible load
[13:55:41] <Wes George> turn everything on at once, vm sorta dies
[13:55:55] <Wes George> 500 hosts on 8 blades is oversubscribed
[13:56:26] <Wes George> dino - what cpu on the blade
[13:56:39] <Wes George> manish - dell, 2 quad-core on each blade, I think 2.6
[13:56:49] <Wes George> slide 18 - emulator validation
[13:56:55] <Wes George> skipped 17
[13:57:20] <Wes George> slide 19 - high emulator load
[13:57:55] <Wes George> real test was only able to run for about 30 sec. emulator ran for 30 min,
[13:58:03] <Wes George> slide 20 - emulator topo
[13:58:47] <Wes George> hosts sending/responding to arp traffic, management switch, and a test swich that allows us to use different vendors switches
[13:58:54] <Wes George> skipped to conclusion/next steps slide
[13:59:02] <Wes George> no idea what number because there's no slide number
[13:59:45] <Wes George> thomas narten - this is interesting work. what papers or documents are aailable?
[13:59:53] <Wes George> charts from nanog aren't big enough to read
[14:00:09] <Wes George> can you make add'l material available that has higher-resolution charts
[14:00:17] <Wes George> i agree that the draft text-only is a limitation in this case
[14:00:38] <Wes George> these slides are higher resolution
[14:00:59] <Wes George> I expect you'd want to see the raw data, maybe more analysis
[14:01:03] <Wes George> I'd like to make it available
[14:01:13] <Wes George> manish - we could probably take the graphs and post it as a separate pdf
[14:01:22] <Wes George> thomas- we'd be able to ask more specific questions
[14:01:35] <Wes George> hemant
[14:01:38] <Wes George> ciena
[14:01:48] <Wes George> why doesn't vm migration have impact
[14:01:57] <Wes George> a: not familiar with VMware's tricks when migrating hosts
[14:02:08] <Wes George> (from one blade to another w/in cabinet)
[14:02:14] <Wes George> tries to minimize disruption in service
[14:02:29] <Wes George> told that vmware doesn't tell the switch
[14:02:38] <Wes George> list of things to do is to figure out what VM ware is doing under the coverse
[14:02:48] <Wes George> hemant- movement should create more arp traffic
[14:03:04] <Wes George> one of the resonas I want to switch to the emulater is to eliminate the effects of that
[14:03:10] <Wes George> which vmware config?
[14:03:23] <Wes George> manish - we haef to bypass the vswitch and go directly to the bridge
[14:03:34] <Wes George> tried to eliminate as many intermediate switches as could
[14:03:44] <Wes George> core switch, top of rack switch,
[14:03:56] <Wes George> using conditional vswitches, not distributed
[14:04:07] <Bill> [conventional]
[14:06:43] <Wes George> warren - if you were to scale up to 20,000 hosts, what does it extrapolate to
[14:06:51] <Wes George> a: well, that goes above 100% cpu
[14:07:08] <Wes George> warren - that's not the CPU load we're seeing - arp load is in the noise
[14:07:27] <Wes George> if all of my machines were in the 100% load doing arp, I wouldn't be serving trffic and i'd be upset
[14:07:38] <Wes George> my switches have dinky little cpus, they're spedngin time doing logging, etc
[14:07:58] <Wes George> benson: make sure we're talking about the sma thing - you talking about 20k hosts on the same l2 segment?
[14:08:12] <Wes George> warren - saying gthat this exists, looking athte cpu on the switch, it's nowhere near that
[14:08:19] <Wes George> there is arp load, but not like this
[14:08:30] <Wes George> switches were not explodingg from arp
[14:08:42] <Wes George> benson: trying to underrstand between distributed layer 3
[14:08:51] <Wes George> warren - this was 20k hosts talking to core switches
[14:08:55] <Wes George> host to host, host to switch
[14:09:04] <Wes George> some host to host, most ly out
[14:09:28] <Wes George> manish - one of the factors we tried to isolate - it's not the number of hosts, but traffic pattern
[14:09:33] <Wes George> scales linearly with cpu load
[14:09:35] <Wes George> not with pps
[14:09:40] <Wes George> but pps of arp traffic drives cpu
[14:09:52] <Wes George> if you have 20k hosts but they're not generating arp traffic , you won't see that volume
[14:10:24] <Wes George> randy bush - current trailer runs 20k hosts (not counting VMs)
[14:10:47] <Wes George> you'd like to plug that into the spine - maybe at layer 2, maybe at l3, but you'd like inside the traile r to be l2
[14:10:51] <Wes George> and that's ugly today
[14:11:26] <Wes George> thomas narten - discussion intersting, but leaves me asking, so what? is 20k pps ok or not ok?
[14:11:32] <Wes George> answer probably it depends
[14:11:41] <Wes George> assume homogeneous env, same os, etc
[14:11:50] <Wes George> probably useful to document different representative implementations
[14:12:04] <Wes George> ms has 2 different arp implementations - vista forward, and pre-vista
[14:12:22] <Wes George> post vista, arp and ND are the same implementation, earlier versions have a different algorithm
[14:12:31] <Wes George> we should document - linux, x, solaris, etc
[14:12:35] <Wes George> steady state
[14:12:47] <Wes George> might help us understand dynamics better
[14:13:06] <Wes George> a: unfortunately I don't have the use of this blade encl forever - perhaps randy can bring his trailer?
[14:13:34] <Wes George> thomas - 20k nodes on L2, we can do that today - yes you can do that today, but I wonder if that was something engineered that way, or if the dynamics are changing
[14:13:53] <Wes George> is it less hard to make work because off changin traffic flows, or what are the impacts of arp traffic on scalability
[14:14:12] <Wes George> rob shakir - key point is to look at tests like this with benchmarking against implementations
[14:14:18] <Wes George> only tells us about that specific imple
[14:14:37] <Wes George> I have scaling problems that I show to my vendor, and they say , I didn't optimize fofr that scenario, we'll do that now
[14:14:51] <Wes George> need to specify which problems are protocol space and which are implementation space
[14:15:01] <Wes George> switch vendors, can you share how you optimize?
[14:15:51] <Wes George> speaker - 4k vpls nodes - 3% arp traffic
[14:15:54] <Wes George> 3-6%
[14:16:02] <Wes George> don't see major cpu impact
[14:16:14] <Wes George> maybe nott exactly like DC
[14:16:25] <Wes George> warren- just ran some numbers, with wild assumptions
[14:16:30] <Wes George> 20K hosts, all time out in 30 sec
[14:16:39] <Wes George> every single host talks to each host
[14:16:43] <Wes George> 666 resolutions per sec
[14:17:05] <Wes George> if you can't do that with a reasonable cpu, or even a small one, there is something fundamentally worong
[14:17:10] <Wes George> this isn't a complex protocol
[14:17:20] <Wes George> benson - good input on more work to do
[14:17:23] <Wes George> please volunteer
[14:18:04] <Wes George> now discussiong :
[14:18:07] <Wes George> slide 2
[14:18:55] <Wes George> oh, my comment from earlier - can we please get VMware and others involved ot discuss their implementations so that we sstop guessing at which results are due to VMware secret sauce vs real results -cann we optimize around their implementations (where there's commonality)
[14:19:55] <Wes George> slide 3
[14:20:06] <Wes George> new idea, not implemented anywhere
[14:20:31] <Wes George> see some issues, but not comletely thought out yet
[14:20:32] <Wes George> slide 4
[14:20:36] <Wes George> pretty picture
[14:21:08] <Wes George> different colors indicate differnet customer VPN
[14:21:15] <Wes George> want the control function to be in the VPN PE router
[14:21:20] <Wes George> also residing at DC GW router
[14:21:30] <Wes George> in that control point, lots of addresses (L2 and L3
[14:21:47] <Wes George> to control waht's happening to the VMs, you have to see the entire network that the vpn can reach so that you make appropriate decisons
[14:21:50] <Wes George> slide 5
[14:22:34] Erik Nordmark joins the room
[14:22:42] <Wes George> slide 6
[14:24:06] <Wes George> slide 7
[14:24:28] <Wes George> vpn pes at each customer private DCs, also provider DC PEs
[14:24:32] <Wes George> tables can gget nasty
[14:24:53] <Wes George> dino - in terms of routers and switches - if you have to store 1K entries, doesn't matter if it's routes/macs/vlan combination
[14:25:02] <Wes George> are you saying complicated due to management, or due to scale?
[14:25:27] <Wes George> a: if you have that many vpns and that many routes, the number of route tables associated can grow exponentially
[14:25:42] <Wes George> it's a fairly well-know problem that the edge router size limitations causses issues
[14:25:48] <Wes George> this will make this problem worse
[14:25:54] <Wes George> slide 8
[14:26:32] <Wes George> private addressing conflicts
[14:26:49] <Wes George> slide 9
[14:27:20] <Wes George> thomas narten : is the address conflict the pain point
[14:27:24] <Wes George> a: it could be
[14:27:37] <Wes George> thomas - wearing my narrow scope hat, that's out of scope for address resolution
[14:27:51] <Wes George> imprtant to have the discussion, but it's a separate q as to whether / how we solve it
[14:28:05] <Wes George> ning - having problems fitting problem into WG charter
[14:28:09] <Wes George> understand, apologize
[14:28:15] <Wes George> someoen can point me elsewhere
[14:28:32] <Wes George> anoop - see them as 2 sep problems
[14:28:41] <Wes George> clashing addresses can be solved with isolation
[14:28:45] <Wes George> andre Mcgregor
[14:29:01] <Wes George> mitigation techniches can make problem of address resolution worse
[14:29:09] <Wes George> the more complex you have, the worse the issue gets
[14:29:20] <Wes George> so, yes and no - somehat peripheral but related
[14:29:36] <Wes George> ning- have not thought through solution, may not belong here, but would like to discuss
[14:29:54] <Wes George> manish - armd charter- arp/nd issues caused by implementations of large dcs
[14:30:03] <Wes George> think this is approrpiate
[14:30:12] <Wes George> thomas- can we tease out the pain points?
[14:30:26] <Wes George> something inherent to L3vpns and overlapping address space that makes this wose?
[14:30:41] <Wes George> ning - no implementation - this is based on experience in other types of networks iwth similar problems
[14:31:01] <Wes George> dino - frame a diff way
[14:31:31] <Wes George> if you want 20k servers in an l2 domain, you'll have probblems of many types. the idea here is that if you also have 20k vpns, that makes it even worse
[14:31:52] <Wes George> benson - would like to get the list involved to the discussion - pointer and desecription of what you'd like to work on
[14:32:03] <Wes George> hemant tushan from ciena
[14:32:21] <Wes George> legitimate problem
[14:32:55] <Wes George> rob shakir - I run l3vpn in this context - do you have hte problem everywhere? you have a bunch of vpns, but you have a bunch of circuits, etc
[14:33:04] <Wes George> so it's an overall scaling problem
[14:33:12] <Wes George> do I scale up or across- do I add more boxes?
[14:33:22] <Wes George> I don't want to buy more boxes, but I also need to limit failure domains
[14:33:28] <Wes George> look at what's practical
[14:33:34] <Wes George> what's cheap
[14:33:48] <Wes George> ning - many of the vendors would love us to buy mjore boxes to solve problem
[14:34:01] <Wes George> rob - ppoint is that we should improve efficency where we can
[14:34:16] <Wes George> but if you have 20k hosts in a massive dc connected to 200 vpns, I don't think you'd connect it to the same box
[14:34:35] <Wes George> ning - I have a solution in mind, buy more and scale horizontally is a nice benefit to vendor
[14:34:50] <Wes George> in our current env, without adding this dimenion, we've already had to do that
[14:34:58] <Wes George> so adding this dimension, it worsens the current problem
[14:35:23] <Wes George> anoop - as you grow dc, more customers you want to isolate
[14:35:29] <Wes George> there are issues like power and cooling that limit that
[14:35:35] <Wes George> you want to maximize the amount of sharing
[14:35:41] <Wes George> providing reliability through redundant equipment
[14:35:52] <Wes George> echo what dino says - provides an add'l multiplier
[14:36:00] <Wes George> average router has 100 ports vs 10 ports years ago
[14:36:03] <Wes George> more devices might have 1000
[14:36:18] <Wes George> speaker: key is not number of ports per box
[14:36:37] <Wes George> key is number of ports per broadcast domain
[14:36:43] <Wes George> in vpls, people don't put more than 500 ports
[14:37:20] <Wes George> ning - I see that applied to reduce the pain, but whether that is the silver bullet we don't know
[14:37:34] <Wes George> david - is the cpu the limit?
[14:37:56] <Wes George> speaker - comment is about BCP more than the limit - we're seeing 3-6% arp
[14:38:20] <Wes George> warren - also in many cases, the 5 things in the vpls domain are 5 routers, not 5 workstations. connects offices together
[14:38:29] <Wes George> each office is a subnet, you want each to be able to talk to one another
[14:38:39] <Wes George> everything behind that isn't visible to the provider
[14:38:57] <Wes George> benson - would you apply the description of netwok you just gave to a DC
[14:39:12] <Wes George> warren - yes
[14:39:28] <Wes George> it's a strange setup to have database in one DC and frontend in another
[14:39:49] <Wes George> VM drives this sort of thing - geographically diverse, but must talk to each other
[14:40:00] <Wes George> DC provider provides hardware, Hypervisor, way to communicate between them
[14:40:21] <Wes George> if you're the person operating the VM host/hypervisor - you can provide L2 mobility by doing stuff on the hypervisor rather than in the network
[14:40:25] <Wes George> stuff is more expensive and slower
[14:40:39] <Wes George> benson - what I'm hearing is "well don't do that" which we touched on
[14:40:48] <Wes George> warren, yes - don't do that, and I'd like a pony
[14:41:13] <Wes George> benson - if you think of the merit reasearch, if we see that it breaks at some poijnt, then we can say "don't do that because" with a hard number
[14:41:20] <Wes George> warren - arp is one of the scaling things
[14:41:46] <Wes George> if you actually run tcpdump on a network like this, arp is less than half of the traffic - and this is an arp-heavy network because it's mostly wirleess
[14:42:03] <Wes George> it'd be helpful for people who run DCs likee this to capture some stats on arp
[14:42:26] <Wes George> possibly it's not arp resolution, but broadcast suppression and mitigation
[14:43:59] <Wes George> my comment - perhaps those who did the stats can make their tools available so that people can use this to gather stats in their own mid-range dcs
[14:44:24] <Wes George> now discussing
[14:44:26] <Wes George> slide 2
[14:45:33] <Wes George> slide 3 - vulnerability issues 1
[14:48:51] <Wes George> slide- vulnerability issues 2
[14:49:10] <Wes George> narten - my assumption is that when VM1 moves to a new machine, it keeps it's mac address
[14:49:17] <Wes George> so which devices has the wrong entry
[14:49:20] <Wes George> it's the router
[14:49:28] <Wes George> no, bcause the IP/mac binding has not changed
[14:49:34] <Wes George> no, it also includes the interface info
[14:49:41] <Wes George> thomas - not sure I agree with that
[14:49:51] <Wes George> are you saying that s1/s2 is in one L2 demain, one subnet
[14:50:05] <Wes George> normally 2 interfaces are 2 subnets if the router is l3
[14:50:17] <Wes George> dino - scenaro - if1 and if2 are l2 interfaces, router is a bridge
[14:50:40] <Wes George> warren - arp entry itself maps from ip to max
[14:50:41] <Wes George> mac
[14:50:48] <Wes George> the switching infra determines which porot
[14:50:49] <Wes George> porrt
[14:51:34] <Wes George> it's a layer 2 event. that has nothing tod o with arp
[14:51:40] <Wes George> that's a switching infra change
[14:51:57] <Wes George> packet comes out of router, hits switching infra, even if it's part of the same router
[14:52:10] <Wes George> that's a layer 2 function, doesn't matter if it's an arp packet, or shiny daisy packet
[14:52:16] <Wes George> this is a bridge learening problem
[14:52:23] <Wes George> and bridge sees it all as frames
[14:52:38] <Wes George> thomas - arp it's per interface, multiple interfaces, arp separately on each interfaces
[14:53:10] <Wes George> need to be clear about wheter it's an arp problem, implementation problem, or a switch learning problem
[14:53:24] <Wes George> speaker- router probably not good word here
[14:54:16] <Wes George> thomas - router has right entry, sends packet, but doesn't get there because switching isn't right
[14:54:29] <Wes George> as soon as VM1 sends something out, then this gets corrected via normal mac learningg pricess, right
[14:54:34] <Wes George> speaker - depends on implemntnation
[14:54:41] <Wes George> how you update mac entry, arp table entry
[14:54:50] <Wes George> some implementations don't correlate the two
[14:54:58] <Wes George> thomas - this is not an arp problem, it's a switch problem
[14:55:10] <Wes George> forwarding at layer 2, you see packet from a different place, yu have to update your table
[14:56:03] kfantonio leaves the room
[14:56:35] <Wes George> erik nordmark - source of confusion - using gratuitous arps when all they want to do is update the learning tables
[14:56:41] <Wes George> doesn't have to be arp
[14:56:58] <Wes George> hemant - agree that this isn't a migration issue for arp, different mechanism for bridging and mac learning
[14:57:00] <Wes George> not correct
[14:57:19] <Wes George> andrew mcgreedy -a gre that this isn't an arp issue, but gratuitous arp isn't the right thing to use
[14:57:44] <Wes George> david - gratuitous arp has a histpory prior to vm migration - used for mac address takeover/failover
[14:57:50] <Wes George> warren - vrrp example
[14:57:58] <Wes George> nicety, not a requirement ofr arp
[14:58:04] <Wes George> they use grat. arp because it's easy to do
[14:58:13] <Wes George> if you're doing arp snooping, that's interesting
[14:58:33] <Wes George> david - sending broadcast packets so you hit the tables. arp is nice because it's failryl beningh
[14:58:44] <Wes George> slide - vulnerability issues (2) - DAD
[14:59:37] <Wes George> slide - feedback
[15:00:13] <Wes George> thomas - discussion has been helpful
[15:00:31] <Wes George> we start with "the probem is a" but as we tease it out we discover "the problem is z"
[15:00:37] <Wes George> the problem is really about updating bridge tables
[15:00:41] <Wes George> when doingg a move
[15:00:54] <Wes George> arp is used by convention, it's benign , works generally well
[15:01:14] <Wes George> regular traffic also generally fixes this
[15:01:22] <Wes George> if the server is sending traffic, this will get updated quickly
[15:01:33] <Wes George> so we can have that conversation - how big is this problem in practice?
[15:01:38] <Wes George> is it worht spending a lot of time on
[15:01:49] <Wes George> on the DAD, how widely is DAD implemented in V4?
[15:01:59] <Wes George> is a client using DHCPv4 required to do it?
[15:02:05] <Wes George> it was a late implementation ot IPv4
[15:02:29] <Wes George> warren - generally a dhcp server pings the address before it hands out the lease
[15:02:33] <Wes George> it should know what it's handed out
[15:02:44] <Wes George> only reaons it happens is if someone has manually configured
[15:02:57] <Wes George> or you have a dhcp server that blows up and loses its lease info
[15:03:18] <Wes George> benson - first problem - debate around arp or bridge learningg
[15:03:24] <Wes George> problem that got looked over beteween arp and ND
[15:03:34] <Wes George> gracefullnes with which they fail is different
[15:03:37] <Wes George> might possibly break ND?
[15:03:42] <Wes George> thomas - not clear to me
[15:03:55] <Wes George> benson - gratuituous arp to signal a move
[15:04:21] <Wes George> congegstion - you lose an arp packet and thinsg break worse
[15:04:28] <Wes George> in ND you keep having address resolution
[15:04:44] <Wes George> thomas - are you asking is arp better than nd in terms of robustness
[15:05:16] <Wes George> IPv6 detects failure within 30 sec, this only recently started happening in v4 in some MS impelemntation
[15:05:36] <Wes George> dino - experience with building routers- if arp can be used at controlplane, which ND wants to provide
[15:05:43] <Wes George> you can pre-populate the table
[15:05:59] <Wes George> the needleinhaystack problem - packets do get lost between router and control plane
[15:06:09] <Wes George> if the gratuitous arps work, so fib can be populated
[15:06:15] <Wes George> therefore it's a control plane thing
[15:06:30] <Wes George> different from arp cache miss where you have packets directed towards a host with no entry
[15:06:42] <Wes George> nd helps because it does dad so things are populated before traffic needs to be sent
[15:06:57] <Wes George> having to populate mac-level info at layer 3, it complicates with a switch at the bottom
[15:07:06] <Wes George> warren - the router doesn't need to proccess the arp
[15:07:13] <Wes George> all you need is for the cam table to see the packet
[15:07:32] <Wes George> dino - yes the router wants to see the arp because it might noit be in the cache
[15:07:54] <Wes George> warren - if the router doesn't have the arp in the cache, whether the VM is moving, doesn't matter
[15:08:02] <Wes George> warren - yes that has nothign to do with the mobility
[15:08:25] <Wes George> erik nordmark - gratuitous arp can be sused for a few things, when ip/address map changes
[15:08:28] <Wes George> also for topology change
[15:08:34] <Wes George> router doesn't know which,
[15:08:39] <Wes George> has to look at it
[15:09:02] <Wes George> if you want to implement arp learning proxies, then it's good that they're in the same place
[15:09:09] <Wes George> benson - the slide is about vm's moving
[15:09:13] <Wes George> and bridge learning
[15:09:21] <Wes George> the comment on congestion deserves its own conversation
[15:09:30] <Wes George> if the arp cache entry times out and disappears you have problems
[15:09:35] <Wes George> in nd's case, you wont have that time out
[15:09:40] <Wes George> trying to separate the two
[15:09:56] <Wes George> hemant - ciena - our problem is within the subnet
[15:10:02] <Wes George> some of this discussion may not make snese
[15:10:28] <Wes George> moving within the subnet, the host has to generate something to upadte the bridge table to avoid blachole
[15:10:37] <Wes George> is this group going tp provid guidance to host application/
[15:10:42] <Wes George> they moved, they have to generat ethe traffic?
[15:11:02] <Wes George> benson - wg scope is DC, which we hope includes hosts
[15:11:23] <Wes George> hemant- previous presenntation said "migration not an issue, magic"
[15:11:35] <Wes George> puzzling to me as to why vm migration had no impact
[15:11:54] <Wes George> still have not gotten this speaker's name
[15:12:22] <Wes George> much more complex netowkr with another bridge - way to ensure that you don't depend on the server app to send out packets to update the switching network
[15:12:40] <Wes George> other environments used gratuitous arp
[15:12:58] <Wes George> if you have a critical app, want to ensure getting packets afte rmigration, need a better solution to populate the switching tables
[15:13:29] <Wes George> warren - you have a vm on the left, you decide you'd like to migrate to the host on the right (planned)
[15:13:42] <Wes George> if this results in blackhole, you have other issues
[15:14:03] <Wes George> the solution is that host on the left knows it has moved, why not tunnel to the host it *knows* it's moved to
[15:14:36] <Wes George> benson - interesting, but off scope
[15:15:08] <Wes George> linda - redirect
[15:15:17] <Wes George> top of rack switch is aware of change and can do some sort off redirect
[15:15:41] <Wes George> warren - hypervisor has knownledge to virtualize this stuff
[15:16:03] <Wes George> would be trivial to have it recieve the packet, and to copy the packets to the other host for the few seconds while the network updates
[15:16:12] <Wes George> you could even buffer during suspend/init
[15:16:22] <Wes George> linda - solution to this problem
[15:16:31] Erik Nordmark leaves the room
[15:16:38] <Wes George> thomas - is the host in scope? per the charter, modifying the generic host is out of scope
[15:16:45] <Wes George> however the hypervisor is a potential opportunity
[15:16:52] <Wes George> small number of hyp, interest in making it work
[15:17:13] <Wes George> one solution is tunneling, other might be trying to prime the pump
[15:17:35] <Wes George> benson - what I mean is that it's in scope in terms of understand the problem, since we're not talking about solutio
[15:17:49] <Wes George> otherr questions?
[15:17:55] <Wes George> ___ session ends ____
[15:17:57] Wes George leaves the room
[15:18:10] Andrew McGregor leaves the room
[15:22:15] Bill leaves the room: Computer went to sleep
[15:23:58] Andrew McGregor joins the room
[15:24:19] Andrew McGregor leaves the room
[15:27:02] Benson Schliesser leaves the room
[17:18:55] Bill joins the room
[17:27:03] Bill leaves the room
[18:48:52] Andrew McGregor joins the room
[19:04:13] Andrew McGregor leaves the room