[clug-talk] nasty problem

Robert Lewko lewkor at gmail.com
Thu May 18 23:40:18 PDT 2006

Well, I think that I got lucky.  One thing that I noticed is that I had some
service.. What happened today is that I started getting UDP packets again.
Someone who has a lot more money than me or my client must have raised the
problem to a level where it got fixed.  There's lots of places where you
just can't use TCP.

Question: did the explanation I gave shed light on the thing or confuse the
hell out of everyone?

On 5/18/06, Gustin Johnson <gustin at echostar.ca> wrote:
> Hash: SHA1
> Robert Lewko wrote:
> >
> >
> > On 5/18/06, *Gustin Johnson* <gustin at echostar.ca
> > <mailto:gustin at echostar.ca>> wrote:
> >
> >     -----BEGIN PGP SIGNED MESSAGE-----
> >     Hash: SHA1
> >
> >     Why are you using UDP instead of TCP?  Since your app is going over
> >     satellite, and latency is not as much of an issue as reliability,
> >     seems better suited.  If a UDP packet is lost on the wire, there is
> no
> >     mechanism to help you isolate the problem.  It is possible that your
> app
> >     is being filtered.  You could try to change the port, but I would
> >     seriously consider TCP.  Keep in mind I am a network admin and not a
> >     developer, you may have a valid reason for using UDP, it is just
> S.E.P.
> >     from my point of view.
> >
> >
> > The reason that I used UDP is that no fork/exec/new socket is used.
> > First you have to understand its not the latency that makes satellite
> > communication hard - although without altering any socket options TCP
> > will detect the latency and start to back off.
> >
> > What happens is that you have two things to worry about.  On the client
> > end you have to worry about losing the connection and reconnecting - ie.
> > one satellite passes out of LOS (line of site)  and it can be up to
> > 20-25 minutes before the next one is in site but the average is 10
> > minutes without service.  When the next satellite you have a 2-3 minute
> > period when you may have very sporadic network availablility.  You may
> > have a 6 second period with network availability, just enough time to
> > dial and get a connection without getting data through.  Once the next
> > satellite gets in sight you can have 90 minutes to 2 hours with good
> > service.
> This almost sounds like the old irridium satellite phone, where the
> actual satellites flew around in a LEO, with no handoff between the
> satellites.
> >
> > So lets consider what it would look like if we used TCP.  This
> > application gets a file in a directory (not my design in this part)
> > every 5 minutes.  It parses the file, puts it in packets then sends it
> > to the server.  So it does a write to a TCP socket and gets an error:
> > "No route to host".  So it closes the socket and calls the windoze shit
> > that dials the network through the modem.  Great! now you have a
> > connection.  OK you are in one of those 6 second spots of connectivity
> > at the start of getting a new satellite.  So you start to send a 4k
> > packet at 9600 baud.  Do you see a problem?  You won't get your packet
> > sent before the network goes down.  Remember the 3 way handshake that
> > TCP needs to use before a connection is made.  Well that uses up about 3
> > seconds right there.  Using UDP you can actually get 2 2k packets
> > through with their ack returned in 6 seconds.  BTW I have restricted
> > myself to a 2k packet size in my program.
> >
> > What's happening on the server?  There are two things you could do:
> > construct a single threaded server or one that uses fork/exec.  They
> > each exhibit a different form of the same problem.  The server accepts a
> > new connection. The accept system call receives a new connection on the
> > listening port, dups the connection on a free port and assigns a new fd
> > for that new connection.
> >
> > The single threaded server will use fd's and the fork/exec server will
> > use slots in the process table.  To make that clearer the single
> > threaded server will get activity that indicates a new connection, call
> > accept to get a new fd to communicate with the new connection and manage
> > that new connection in the next select call.  The concurrent server (one
> > process per connection) will wait for accept to return a new fd for a
> > new connection, then it will fork/exec to make a new process to handle
> > that connection.
> >
> > OK so data comes into that fd for a while until the client gets a broken
> > connection.  At that point the server socket will wait for hours,
> > literally indefinitely for more data on that fd.  So now you have to put
> > a timer on each fd/process so you can detect when no data has been
> > received for whatever timeout period that you decide to use (what do you
> > use for the timeout period?).  Keep in mind that 2-3 minute period can
> > generate 10-12 broken connections.  So depending on which server design
> > used there will either 10-12 unused fd's that have no client or 10-12
> > processes that are there listening with no client to give them data.
> > Also know that there are possible 10 to 12 mobile systems doing that and
> > now you have the possibility of 100 to 150 unused resources that need to
> > be cleaned up and that each process has a maximum number of fd's and a
> > maximum number of processes that can run.  So, what if they bought
> > another company with a similar number of trucks or another company 10
> > times larger bought them 'cause of the way that they do "real time"
> > testing?  Instant problem!
> >
> > This whole discussion is based on that when I get a broken connection
> > when the client sends some data that there is no way to tell the socket
> > that it can try again.  If someone knows how to do that and can point me
> > to docs then I will be glad of the info.  In my reading of Stevens I
> > didn't see how to recover from a broken connection.
> >
> > Using UDP just side steps these issues.  You put the responsibility for
> > the communication on the client.  The client is the one that detects
> > when the packet has not been sent by putting a sequence/timestamp in
> > each packet and comparing that tuple to each packet that is returned.
> > If the seq/ts does not match the one you are looking for then dump it.
> > When it does match you can process the packet and transmit the next one
> > (there are more efficient ways of handling multiple packets, but I'm
> > keeping it simple).
> >
> > The UDP server can be MUCH simpler by handling each packet as a self
> > contained entity.  It gets a packet from the client, processes that
> > packet, then uses the source addresss as the destination for the return
> > packet.  No fork, no exec, no accept/new fd, no cleanup.  With UDP you
> > can have one process that deals with one packet at a time.  What you
> > have to ensure is that the server does not get busy enough that clients
> > will get thier response before the end of the retransmit delay.
> It sounds like you are between a rock and a hard place, since tracking
> down UDP packet loss is not fun.  Personally, if I was asked to support
> this environment, I would push to replace Layers 1&2 with something more
> robust.  I am assuming that this is not an option.  I am also going to
> guess that GPRS and TDMA (cell data networks) are not viable options?  I
> do not envy the work you have in front of you.  You really were not
> being overly dramatic with the subject line.
> Version: GnuPG v1.4.1 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> iD8DBQFEbQHzwRXgH3rKGfMRAjeuAJ0RLsT2sPRubwGg/IUeh+A412FJSgCfQpVy
> u6UMGc+VY0Cb7ONINcApCCo=
> =4GTk
> _______________________________________________
> clug-talk mailing list
> clug-talk at clug.ca
> http://clug.ca/mailman/listinfo/clug-talk_clug.ca
> Mailing List Guidelines (http://clug.ca/ml_guidelines.php)
> **Please remove these lines when replying
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://clug.ca/pipermail/clug-talk_clug.ca/attachments/20060519/e1d6f060/attachment.htm

More information about the clug-talk mailing list