alexseanchai: Katsuki Yuuri wearing a blue jacket and his glasses and holding a poodle, in front of the asexual pride flag with a rainbow heart inset. (Supernatural music)
let me hear your voice tonight ([personal profile] alexseanchai) wrote2012-08-23 12:42 am

this is crazy, y/y?

I want to make a social-justice-aware random-character generator. Preferably web-based rather than something to download from Sourceforge, ideally using numbers from random.org. Tell it whether you want a single character or a hundred, and if a hundred, whether you want them to be a microcosm of the community or a random selection of individuals from the community. Difference being, if the sex ratio in the community is 1:1 exactly, the microcosm would have fifty of each, while the random selection would be close to fifty each but not necessarily right on the button. A microcosm of the US would have eleven or twelve Californians and three or four individuals from as many of the sixteen US states and populated territories that each have less than half a percent of the US population; a random selection of individuals from the US might have more than twelve or fewer than eleven Californians, and might have representatives from more than four or fewer than three of those sixteen places and/or more than one representative from any one of those sixteen places.

Then go through a series of screens, sex assigned at birth, gender identity, sexual orientation, race, ethnicity, religion, physical disability, mental illness, household income, age, height, weight, place of origin, urban vs suburban vs rural, I'll probably think of more later. Default input values assume the community is the United States as of the most recent data I can find, but the user should be able to change the labels and the percentages to suit (and the page should throw an error if the percentages don't total 100% exactly, or else the last input should be unchangeably labeled 'Rounding Error' and automatically change to ensure that the inputs total 100%, and in that second case the page should throw an error if Rounding Error is negative or too high); that way the user can get a hundred representative Alaskans or Hawaiians or Londoners or Fictionalplaceians as easily as a hundred representative USAians, provided the user has access to demographic data for the community in question.

And, though I'm not sure I can pull this off (I don't know if the raw data is out there—obviously Miguel is a much more frequent name among Hispanic folk than Michael is, and Michael much more frequent among non-Hispanic folk than Miguel, but how often does each name occur in each population?), I want to cross-reference at least sex and race, perhaps also ethnicity and age, with US census name-frequency data in order to give the characters names. I don't think I'd be able to provide that for any community other than the US as a whole, though, so it might be better to let the user find appropriate names for the randomly generated characters.

The way this would work is, each page would compare a series of random numbers to the input values. For instance, say the input values are heterosexual 65%, bisexual 15%, homosexual 15%, asexual 5%. First random number is 12, which is between 1 and 65 inclusive, therefore character 1 is straight. Second number is 88, which is between 81 and 95 inclusive, therefore character 2 is gay. Ideally the page would get the numbers straight from random.org (whether a list, which allows repeats, or a sequence, which does not, would depend on whether the user was looking for a random selection or a microcosm), rather than relying on the programming language's random-number function and rather than relying on the user to follow instructions re generating a list or sequence as appropriate on random.org and copying said numbers into an input field on the page. End result would be a database or a delimited text file that the user could open in or copy to a spreadsheet program, sort by whatever column if desired, and pick a line where the character's demographic characteristics suit the user.

Does this sound feasible? If so, what programming language do I need to learn to do it, how much time/effort is it likely to take me, and what would I need to do in order to make it publicly accessible?
ysabetwordsmith: Cartoon of me in Wordsmith persona (Default)

SQUEE!!

[personal profile] ysabetwordsmith 2012-08-23 07:19 am (UTC)(link)
I want this so very much. I basically did something similar to generate details for most of the Schrodinger's Heroes characters and I loved that.

If you need money for the project, try crowdfunding, there are probably other folks who would help. I could boost the signal if you need that at any point. I'd love to hear more about this.
ysabetwordsmith: Cartoon of me in Wordsmith persona (Default)

Re: SQUEE!!

[personal profile] ysabetwordsmith 2012-08-23 07:49 am (UTC)(link)
Well, part of it is simple. You have a set number of fields, each of which needs to process random number input; and some processes for making those interact to derive a desired result. Computers can do that. It's just a matter of whether you can figure out the code. I'd say start with that, build it first.

Next you have the demographic data, so it can do things like pick names, or recommend known parameters for a given culture/timeframe. That's going to be harder. Names you could probably pull from a database somewhere, the web is full of stuff like that because it's popular info. More detailed demographics you'll have to hunt down one at a time. Maybe consider asking people which things they would use most often, and go in order of popularity.

Some of that will be easy. We have pretty good data supporting that about 10% of the population is homosexual, and some data suggesting that about 1% is asexual. List those, link some sources, and it's usable. Disabilities vary; some are much more common in certain populations. But you could rough-sort to catagories like "impaired vision, impaired hearing, limited mobility" etc. It might be worth adding in some statistics for veteran injuries; there are studies tracking the signature injuries of different wars and what a typical sample of 100 nonfatal casualties would be. You could probably pull racial demographics from census records, but that's tricky because those categories change in politically-inspired freaky ways over time.

Anyhow, do the core first if you can figure out the coding, and expand with fancier stuff as you go along.
twisted_times: A grey yin-yang like symbol, but with a pentagon and a golden apple incribed with the word "Kallisti" replacing circles. (Sacred Chao)

Re: SQUEE!!

[personal profile] twisted_times 2012-08-23 10:46 am (UTC)(link)

I was going to say mysellf after reading over the OP, the coding isn't goign to be too hard per se, but getting the demographic data correct is going to be hard, especially if you want it to be right not just for one single timeframe and place/society, but across multiple aspects of all of the above.

As an aside, I do software testing (primarily of website functionality and UI) as my paying job, so feel free to ping me once you get it up-and-running to have someone put it through its paces. :)

twisted_times: Animated icon saying "Sing like nobody's listening, live like you'll die tomorrow, dance like nobody's watching..." etc (dance)

Re: SQUEE!!

[personal profile] twisted_times 2012-08-25 05:50 pm (UTC)(link)

You're welcome. Be warned - I've been told I'm a natural at breaking web pages. ;p

ysabetwordsmith: Cartoon of me in Wordsmith persona (Default)

Re: SQUEE!!

[personal profile] ysabetwordsmith 2012-08-24 06:24 am (UTC)(link)
Oh ... yeah, I should mention that too. I do the kind of beta-testing where someone wants me to poke at their website, try to break it, and point out anything that is less than perfectly clear. Sort crash-test it.
avia: A cute cygnet with a happy and blushing expression, drawn in a dramatic cartoon style. (happy cygnet)

[personal profile] avia 2012-08-23 07:29 am (UTC)(link)
I don't know ANYTHING about programming, but yes yes yes yes I want this~!
synecdochic: torso of a man wearing jeans, hands bound with belt (Default)

[personal profile] synecdochic 2012-08-23 07:57 am (UTC)(link)
Does this sound feasible? If so, what programming language do I need to learn to do it, how much time/effort is it likely to take me, and what would I need to do in order to make it publicly accessible?

1) Not only feasable but fairly easy (your big problem would be maintaining the tables of proportions, etc, and I would counsel you to make those MySQL-stored and editable on-site with an admin tool rather than hardcoding them as constants in the source code);

2) PHP or Ruby are the common ones, with slight preference to Ruby (PHP is no longer the default My First Webdev Language); I'd personally probably just hack something together in Perl, but that's because Perl is the Swiss Army Chainsaw of programming languages and i already know what I'm doing with it. Ruby on Rails is likely your best bet.

3) It's not a complex system as you spec it. Honestly, a good half of your complexity as specced would be in the call-outs to random.org because they don't have an API to the best of my knowledge and you'd have to screenscrape and parse for zero reason; use the random functions your language comes with instead. Depending on how you did it you'd need to learn some basic MySQL syntax, enough to create, access, and manipulate the db tables at least, but Ruby on Rails makes that easy. Your big issues while coding would likely be getting the math right, not any implementation details.

4) Web-accessible server running whatever your programming language of choice is + MySQL + Apache, pretty much your bog-standard LAMR (linux + apache + mysql + ruby) or LAMP (s/ruby/p[erl|hp|ython]) stack. Just about any hosting package would qualify. See if somebody has a Dreamhost promo code kicking around or something.
synecdochic: torso of a man wearing jeans, hands bound with belt (Default)

[personal profile] synecdochic 2012-08-23 07:59 am (UTC)(link)
oh, and: whatever book you get, make sure it contains a chapter on sanitizing user input before trusting it for ANY PURPOSE, AT ALL, EVER, PERIOD. Like 90% of security problems stem from failure to sanitize user input. Just because you tell people to input percentages doesn't mean they won't introduce you to little Bobby Tables.
synecdochic: torso of a man wearing jeans, hands bound with belt (Default)

[personal profile] synecdochic 2012-08-23 08:07 am (UTC)(link)
data should be sanitized:

* when you accept it from the browser
* immediately before you manipulate it in any fashion
* before you write it to the DB
* upon reading it from the DB
* before sending it to the browser

You can pare that down in many contexts but as you work your default attitude should be: sanitize any data that could have come from the user in any way, every time you are going to work with it. once something has touched user-supplied data, it is tainted until you clean it, and everything it touches is tainted until you clean THAT. don't pass around tainted data, don't rely on tainted data, don't pass tainted data ... you get the picture.

Some further reading to introduce you to the basic concepts and the ways in which people disagree:

http://coding.smashingmagazine.com/2011/01/11/keeping-web-users-safe-by-sanitizing-input-data/
http://diovo.com/2008/09/sanitizing-user-data-how-and-where-to-do-it/
http://stackoverflow.com/questions/34896/when-is-it-best-to-sanitize-user-input
Edited (clarify a bit) 2012-08-23 08:08 (UTC)
redsixwing: A red knotwork emblem. (Default)

[personal profile] redsixwing 2012-08-23 02:06 pm (UTC)(link)
THIS.

So much.

Also, I am a software tester. Need someone to find holes in your logic/bugs in your webpage? I can be that person. :D
twisted_times: Animated icon saying "Sing like nobody's listening, live like you'll die tomorrow, dance like nobody's watching..." etc (dance)

[personal profile] twisted_times 2012-08-23 10:42 am (UTC)(link)

*sniggers*

I was waiting for Little Bobby Tables to make an appearance. :D

synecdochic: torso of a man wearing jeans, hands bound with belt (Default)

[personal profile] synecdochic 2012-08-23 10:46 am (UTC)(link)
Anyone who accepts user input should be well acquainted with little Bobby, yes. :)
twisted_times: Black on white image of a tiger seen from head on, walking directly towards the viewer. (Tiger)

[personal profile] twisted_times 2012-08-25 05:48 pm (UTC)(link)

Any user interface that accepts data via a user input that contains any sort of character set should be prepared to meet Little Bobby Tables at some point.

ysabetwordsmith: Cartoon of me in Wordsmith persona (Default)

*laugh*

[personal profile] ysabetwordsmith 2012-08-24 06:27 am (UTC)(link)
ROTFL at cartoon.
davv: (Seemingly) technical contradiction (Code)

[personal profile] davv 2012-08-23 05:06 pm (UTC)(link)
I seem to recall that the microcosm version is very hard if you want the conditional probabilities to be right. That is, you can find out how far a departure from probabilities you can get with the best possible allocation, but if you also take combined probabilities like "fraction of population having both properties X and Y", "fraction of population having neither", into account, then finding the optimal (most proportional) assignment of properties to characters is NP-hard - one can reduce set cover to the decision version of the problem.

I could be wrong, but I seem to remember that's the case. If you need only a few characters, you could probably use an integer programming solver to get the best microcosm.