IRC logs of #boinc for Monday, 2011-10-31

00:01 <MTughan> What would the symptoms be if it couldn't create the GUI RPC socket?

00:01 <PovAddict> it should say so in the log

00:03 <MTughan> Nothing in the stdout from running the client manually.

00:03 <MTughan> None of the stdout*.txt or stderr*.txt files are being updated either.

00:03 <MTughan> And of course, the permissions appear to be wrong on those files too...

00:04 <PovAddict> the log files are only used if you run the client with --redirectio or --daemon, otherwise it just logs to stdout/stderr

00:04 <PovAddict> also, I just now realized what redirectio means o.O

00:05 <MTughan> lol

00:05 <PovAddict> I always wondered why the n was missing from redirection, if it was a typo or a technical limitation on option length

00:05 <MTughan> Not sure if this is indicative of a problem either, but a project I had recently added doesn't have files in the notices folder. Could it be causing a problem?

00:05 <PovAddict> I just now noticed it means 'redirect I/O'

00:06 * PovAddict feels enlightened

00:09 <MTughan> Nuts... Something messed up all the permissions on the subfolders in projects too!

00:11 <MTughan> Recursive permissions fix worked.

00:12 <MTughan> Don't ask what happened, I have no clue myself.

00:12 <MTughan> Now I need more RAM for my computer... BURP apparently needs 12.5GB or so, and I only have 8.

00:15 <MTughan> Anyone know where the manager knows where to start the client? I'd like to move the data folder onto my RAID array. 80GB SSD is a little limiting, especially with only 20GB left.

00:15 <MTughan> And then, I shouldn't have to restore the folder from backup or a failing hard drive again.

00:25 *** MTughan has quit IRC

00:25 *** MTughan has joined #boinc

00:25 <MTughan> Well, so that was a colossal failure.

00:25 <MTughan> I found all instances of C:\ProgramData\BOINC in the registry, and changed it to my RAID drive letter. Copied the data folder over, started the manager. Fine, picked everything up.

00:26 <MTughan> Start computation, and instant BSOD.

00:26 *** licensed has quit IRC

01:40 *** efc has joined #boinc

01:51 *** Haldrik has joined #boinc

02:19 *** quail has quit IRC

02:29 *** bennym has joined #boinc

02:31 <bennym> Are there any docs on the BOINC client/manager protocol? I want to write a manager client for my phone (to at least view the tasks) and a status webpage (serverside generated).

02:39 <efc> Manager app sounds like fun.. not sure about the protocol docs, don't know anything you won't think of yourself.

02:39 <efc> we do have another channel for that kind of thing, no idea if anybody is on it

02:39 <efc> #boinc_dev

02:44 *** bennym_ has joined #boinc

02:45 *** bennym has quit IRC

02:45 *** bennym_ is now known as bennym

03:55 *** desti has quit IRC

04:12 *** desti has joined #boinc

04:33 *** Caterpillar has joined #boinc

04:40 *** quail has joined #boinc

05:43 *** Marcofe has joined #boinc

06:05 *** Aeternus has joined #boinc

06:05 *** Aeternus has quit IRC

06:05 *** Aeternus has joined #boinc

07:07 *** Haldrik has quit IRC

07:22 *** Haldrik has joined #boinc

08:00 *** trogdog has joined #boinc

08:05 *** bennym has quit IRC

09:23 *** desti has quit IRC

09:55 *** desti has joined #boinc

10:15 *** efc has quit IRC

10:21 *** trogdog has quit IRC

10:50 *** Marcofe has quit IRC

10:53 *** Marcofe has joined #boinc

10:53 *** Marcofe has quit IRC

10:53 *** Marcofe has joined #boinc

11:12 *** mweltin has joined #boinc

11:51 *** licensed has joined #boinc

11:51 *** licensed has quit IRC

11:51 *** licensed has joined #boinc

12:07 *** Marcofe^ has joined #boinc

12:07 *** Marcofe^ has quit IRC

12:32 <Caterpillar> seriously, why ibercivis is no longer giving working units?

12:50 <PovAddict> MTughan: CUDA or just CPU?

12:57 *** RoyK has joined #boinc

13:02 *** [B^S]renemayer has joined #boinc

13:03 <RoyK> with boinc, can I setup, say, 1000 jobs to be scheduled, and if some run too slowly, to save their contents and move them to another node? that is - what sort of scheduling control is there?

13:06 *** dp has joined #boinc

13:10 *** [B^S]renemayer has quit IRC

13:10 <RoyK> thinking of stuff like Condor is supporting, like migrating tasks http://www.cs.wisc.edu/condor/description.html

13:10 <Romulus> Title: What is Condor? (at www.cs.wisc.edu)

13:10 <PovAddict> RoyK: you can't move tasks between computers, the server may have sent you a task *because* of the properties of that one computer

13:11 <PovAddict> for example it may have sent you a task because your CPU model matches the CPU model of another computer that ran the same task, to ensure the results of both can be compared properly

13:12 <RoyK> hm.. maybe condor might be better for this, then

13:14 <PovAddict> ideally the client machine shouldn't be downloading more tasks than it can handle to begin with

13:14 *** kamiccolo has joined #boinc

13:14 <RoyK> yeah, but AFAICS condor solves this by migrating tasks if a client needs its resources for some reason

13:15 <PovAddict> in BOINC your only alternatives for that are:

13:15 <PovAddict> - send the task to multiple computers since the beginning; if a machine goes down or can't finish in time, some other computer will already have it in its queue too

13:16 <PovAddict> disadvantage: if nothing goes down, you do work twice or more

13:16 <PovAddict> but with untrusted machines, work duplication is usually done anyway, to verify the results

13:17 <RoyK> I see

13:17 <PovAddict> - if a client needs to suspend its BOINC duties, have it *abort* all tasks and report them as aborted to the server, so the server will re-send it to someone else

13:18 <RoyK> so for an in-house installation, boinc may not be as suitable as condor?

13:18 <PovAddict> I don't know anything about condor

13:18 <PovAddict> BOINC has no problems with "in-house installations"; the problem is that BOINC is designed for high throughput, not low latency

13:18 <RoyK> I hardly know any of them

13:20 <RoyK> the jobs we're probably be distributing will, on average, take anything from a few hours to a couple of weeks to finish on a fairly modern cpu

13:21 <RoyK> we currently have six machines with a total of 136 cores with another few 64-core machines in the queue

13:22 <PovAddict> most public projects (especially SETI@Home, the original) have thousands or millions of tasks to do, and it *might* happen that a single short task ends up taking months

13:22 <PovAddict> for example, it may get sent to someone who is trying BOINC for the first time, and uninstalls it within an hour, so the task isn't ever run or returned

13:22 <PovAddict> task deadline arrives, task gets sent to someone else

13:23 <RoyK> yeah, but if I understand BOINC correctly, it's main strength is for long-term projects, not if you want to start a simulation and want it to finish within the week - am I right?

13:23 <PovAddict> who finishes it quickly but doesn't connect to the internet again for two more days

13:23 <PovAddict> then he returns it and it turns out to be invalid data (user has bad RAM), so it has to be re-sent *again*...

13:23 <desti> you might work around it, if you send trickles during the calculation and then create a new working from it

13:23 <PovAddict> but it doesn't really matter that that one task takes months, because meanwhile a ton of other tasks got processed

13:24 <PovAddict> like I said, high throughput, not low latency

13:24 <PovAddict> RoyK: that's correct

13:24 <RoyK> thanks for the info :)

13:24 * RoyK reads more about condor...

13:25 <PovAddict> if you want to start a simulation and want it done within a week, a very high proportion of the tasks will get done, and then you have the 'long tail' of tasks that take longer due to bad luck and a chain of problems :P

13:25 <RoyK> that was what I feared...

13:26 <PovAddict> a workaround may be that once you have 1% of tasks remaining (and probably a lot of idle nodes), you replicate them like crazy

13:26 <PovAddict> for each task, *some* computer will finish it quickly

13:28 <RoyK> I like the Condor approach better - checkpoint+migrate the job if needed

13:28 <PovAddict> migrating a task while it's halfway done?

13:28 <RoyK> yeah

13:29 <PovAddict> that's in general a hard problem, but one reason why BOINC has absolutely nothing like that is that it's usually behind low-bandwidth links, not a LAN

13:29 <RoyK> but then, we're on a LAN :P

13:29 <PovAddict> it's impractical to have somebody's PC upload 1GB of intermediate data to be downloaded on another PC to resume the task :)

13:29 <RoyK> indeed

13:29 <PovAddict> and there's the problem of portability of that intermediate data

13:30 <PovAddict> none of that is a problem if you have a LAN network of homogeneous nodes

13:30 <RoyK> so long as either the code is written well enough, or at least as long as you have the same endianess, it shouldn't be a big issue

13:31 <RoyK> I mean, what can possibly go wrong? (famous last words...)

13:31 <PovAddict> many real projects are affected by different floating point rounding between AMD and Intel! :)

13:31 <RoyK> ouch

13:31 <PovAddict> for example, they can't run one copy of a task in AMD, another on Intel, and then check if the results are identical to tell if they are valid

13:31 <RoyK> got a link with info on that?

13:32 <PovAddict> so there's a feature on BOINC so that when a copy of a task is sent to an AMD machine, all other copies of that task are also sent to AMD, so that they can be compared

13:32 <RoyK> all our current compute nodes are AMD, but then, if using boinc or condor, the whole point is to use different systems...

13:32 <PovAddict> there's a lot of flexibility to let the project customize what pairs of platforms are equivalent

13:33 <RoyK> anyway - bbl - I need to read up on this condor....

13:33 <PovAddict> http://boinc.berkeley.edu/trac/wiki/HomogeneousRedundancy

13:33 <Romulus> <http://tinyurl.com/7h45uh> (at boinc.berkeley.edu)

13:34 <RoyK> thanks

13:38 <RoyK> some references to Eric McIntosh - I guess the guys changed his name after falling in love with apple? :)

13:45 *** Marcofe has quit IRC

14:06 <RoyK> reading up on Condor, it seems to be a rather good solution give a good network, not home-users.....

14:08 *** dp has left #boinc

15:36 *** PovAddict has quit IRC

15:37 *** PovAddict has joined #boinc

16:34 *** whynot has joined #boinc

16:54 *** licensed has quit IRC

17:10 *** licensed has joined #boinc

17:10 *** licensed has quit IRC

17:10 *** licensed has joined #boinc

18:15 <CoderForLife> I have had some experience with Condor, RoyK.  We used it at work for a dedicated application, not a general purpose, public platform.

18:15 <PovAddict> I don't think Condor is ever used for public stuff

18:16 <PovAddict> http://mb.walter.silvergeeks.com/wp-content/gallery/xyz/gesetz_der_serie.jpg this is awesome

18:16 <Romulus> <http://tinyurl.com/3uq4u3v> (at mb.walter.silvergeeks.com)

18:20 *** MTughan has quit IRC

18:21 *** MTughan has joined #boinc

18:25 <RoyK> CoderForLife: seems a good platform if you have sufficient bandwidth

18:26 <RoyK> CoderForLife: or am I wrong?

18:26 *** MTughan has quit IRC

18:28 <PovAddict> in a corporate environment, anything is sufficient bandwidth

18:28 <PovAddict> BOINC was made in dialup days

18:29 <RoyK> that was some time ago....

18:30 <PovAddict> SETI@Home started in 1999

18:32 <PovAddict> BOINC started in 2002, not sure when SETI@Home for BOINC officially opened to the public, but they closed the old ("classic") pre-BOINC system on 2005

18:33 <RoyK> yes, and CP/M was a good system at the time, but things moved forward a bit

18:34 <PovAddict> heh

18:34 <PovAddict> well, I won't tell you much about the internals of BOINC

18:34 <PovAddict> don't want to discourage you :P

18:34 *** Haldrik has quit IRC

18:35 <RoyK> PovAddict: it's open source, isn't it ;)

18:35 <PovAddict> yes it is

18:35 <RoyK> so it shouldn't be hard to read....

18:36 <PovAddict> reading code is harder than writing code, that's a known and proven fact :)

18:36 <RoyK> but then, I think something like Condor or perhaps Celery might be better for an in-house project

18:37 <RoyK> PovAddict: I've done my time reading Asterisk PBX source and having been through that, I think I can decipher most other projects :P

18:37 <CoderForLife> we were running it on a switched 1Gb LAN - bandwidth wasn't an issue

18:38 *** whynot has quit IRC

18:39 <RoyK> CoderForLife: did you find any bad issues with it?

18:39 <PovAddict> CoderForLife: condor or boinc?

18:40 <CoderForLife> We thought that it would work if we had the need for a home-grown solution.  In the end, we adopted Microsoft HPC Server 2008, which our software vendors were all going to support.

18:40 <PovAddict> eww :P

18:41 <CoderForLife> We have 1000 cores in HP blade centers

18:41 <RoyK> with mickysoft?

18:41 <CoderForLife> used for stochastic modeling of financial reserves required to support annuity products

18:42 <CoderForLife> yes HPC

18:42 <CoderForLife> we are planning to add the capability to burst to Azure next year when that's supported

18:43 <RoyK> I guess I'll lose 95% cudos at work if I were ever to suggest an MS solution

18:43 <PovAddict> that's the spirit

18:43 <PovAddict> RoyK: see my link above

18:43 <CoderForLife> also looking into using only Azure for disaster recovery

18:44 <CoderForLife> it was a common platform for all the packages we run

18:44 <CoderForLife> supported by MG-ALFA, MoSES and Polysystems - one grid, a bunch of uses

18:45 <RoyK> I think using Condor with a checkpoint server with ZFS would be a rather good idea

18:45 <CoderForLife> pretty much mandated by regulations/laws

18:45 <CoderForLife> Windows Server 2008 is out standard server OS

18:45 <CoderForLife> er our

18:46 <CoderForLife> we are hosting that and WebSphere under VMware

18:46 <RoyK> no offence, but windows servers are dead heavy

18:46 <CoderForLife> regulators love them

18:46 <RoyK> I don't

18:47 <PovAddict> RoyK: CoderForLife is in the insurance industry, it moves slow and chained to laws and regulation

18:47 <CoderForLife> when you become a regulator, then I'll do things your way =)

18:47 <RoyK> PovAddict: meaning they should use HP/UX or WinNT 4? ;)

18:47 <CoderForLife> thanks for summing it up so eloquently

18:48 <PovAddict> RoyK: no, meaning a lot of their software in those shiny Windows 2008 servers are really frontends to *mainframes*

18:48 <CoderForLife> we only have 2 PCs running NT 4 SP 4a

18:48 <CoderForLife> er 6a

18:48 <PovAddict> CoderForLife: I think that's slightly unsupported by now :)

18:48 <CoderForLife> only 1 mainframe

18:49 <CoderForLife> tell my users who won't let go of them

18:50 <CoderForLife> something about the need to print insurance policies - go figure

18:50 <CoderForLife> I have bought the replacement system - trying to get them to adopt the new thing

18:51 <RoyK> btw, if anyone were to think of trying linux on Hyper-V.....

18:51 <RoyK> that sucks rater badly

18:51 <CoderForLife> http://www.thunderhead.com/ is the new thing

18:51 <Romulus> Title: Enterprise Communication Platform (at www.thunderhead.com)

18:52 <CoderForLife> going to check on the trick-or-treaters

18:53 <RoyK> "Have you tried turning it off and on again?"

19:03 *** klaas has quit IRC

19:05 *** klaas has joined #boinc

21:07 *** Aeternus has quit IRC

22:27 *** Caterpillar has quit IRC

23:16 *** licensed has quit IRC

23:29 *** licensed has joined #boinc

23:29 *** licensed has joined #boinc

Generated by irclog2html.py 2.4 by Marius Gedminas - find it at mg.pov.lt!