I'm Brett Slatkin and this is my personal site. I write code. These are my projects:

23 April 2014

Please write in the inverted pyramid style (tl;dr first). Especially email.

21 April 2014

Evolution

Alembic worked well for my MySQL migration. How does it compare to Percona Toolkit? What's better?

20 April 2014

I loathe schemas

I thoroughly enjoyed Gary Bernhardt's talk from PyCon entitled The Birth & Death of JavaScript.

Maintaining open source projects is hard

I gave dpxdt, my perceptual diff tool, some much needed love today. Feels good to be gaining momentum again. After my final commit of the night I searched around for "perceptual diffs" as I do. I came across a similar tool called Diffux that was released by Causes back in February. Somehow I totally missed it! In their announcement post they wrote this:

Before deciding to build Diffux, we scanned the open source market for some alternatives. Dpxdt looked promising, so we gave it a spin. It got the job done, but the project looked abandoned (6 month old PRs hadn’t received any attention, last commit was in August 2013) and we couldn’t get the test suite to run locally. Plus, Dpxdt is written in Python, and we are no Python experts. So there was a bit of a hurdle in debugging and adding functionality.

This is the kind of thing that bums me out. I wish they had sent me an email or something.

For a year I wrote a lot of code. I added features and fixed bugs. I merged contributions from others. I made it easy to deploy to production. And it debuted at Velocity. But what Causes wrote is true. I didn't enhance the project for 6 months after last summer. What can I say? I've been busy. I'd like to blame GitHub for never sending me notification emails. But that's lame.

The truth is I am completely responsible for not making forward progress. I can't be mad. I just wish I had done a better job of maintaining the project.

This is one of the frustrating parts of open source. It's hard to team up with others across perceived boundaries. Yet another example I saw recently is Chef vs. SaltStack. What's the difference? They both do automation. Sure, they have different customers and different architectures. But the obvious difference is Chef is for Ruby people and SaltStack is for Python people. That's all there is to it sometimes.

Anyways, I'm happy to see more perceptual diff tools out there! I look forward to when we all take it for granted.

19 April 2014

When you get the Travis CI build to pass and they still don't merge your pull request.

18 April 2014

Don't reply to email

This week I helped a friend of a friend understand the reality of managing a team for the first time. I mentioned a few things about productivity offhand they found useful. Reproduced here is how I handle the onslaught of incoming email:

1. All communication must be on mailing lists to create a body of searchable knowledge and overcome the bus factor*.

2. Never reply to an email if anyone else on the thread also knows the answer.

3. Always reply when you have information that nobody else does.

4. If something is important they will email you repeatedly, IM, call, show up in person, etc.

5. Worst-case: Wait a day (or week, or month) and finally reply to an email yourself.


If I didn't do these things I would never find time to design, review, write code.


* Direct emails for sensitive things are fine, but that's the only exception.
Wonderful description of what to expect from good product managers.

Data fusion has no error bounds

Over the years I've seen attempts to solve what is called the "data fusion" problem. What is that? You have one useful dataset. You have another useful dataset. The goal is to somehow merge them together to create one larger, unified, and more powerful dataset. Sounds awesome! The problem is the two datasets are disjoint and thus have no overlapping sources. There is no simple key with which to join them together.

Companies have been built and busted trying to accomplish this, often in the advertising space. The same idea applies to likely-voter modeling and more.

So can it be done? Let me show you with a simple example.


Imagine you have 3 variables and you want to measure their correlation: X, Y, and Z. For this example let's say X is owning a car, Y is playing golf, and Z is traveling every week for work. Our hypothesis is these have a high correlation.

Say you send 3 separate surveys, A, B, and C, to different groups of random people to measure these variables.

Survey A is like this (X and Y):

Q1. Do you drive a car? [Yes | No]
Q2. Do you play golf? [Yes | No]

Survey B is like this (Y and Z):

Q1. Do you play golf? [Yes | No]
Q2. Do you travel weekly for work? [Yes | No]

Survey C is like this (X and Z):

Q1. Do you drive a car? [Yes | No]
Q2. Do you travel weekly for work? [Yes | No]

Survey A gives you correlation of Car and Golf, variables X and Y. Survey B gives you correlation of Golf and Travel, variables Y and Z. Survey C gives you correlation of Car and Travel, variables X and Z. That leads to this question:

With datasets for correlation of XY, correlation of YZ, and correlation of XZ, can you calculate the correlation of XYZ? This is exactly data fusion problem. The answer is:

No, you can't. Here's why:



You haven't measured XYZ. How do you calculate it? How can you put boundaries on its size? There are actually 8 set memberships you're trying to determine:



You know none of these.

You could assume a uniform distribution of Z in the set XY. Assuming Z (Traveling) is split Yes/No as 40/60 in the general population (the red circle), then also assume it's split 40/60 in the Car & Golf population set (the green section, XY). That sounds reasonable, but there is no way to actually calculate an error boundary on that assumption. You have no idea what the interior of XYZ looks like. It could be a "rogue wave" of correlation, where the distribution of Z (Traveling) in the set XY (Car/Golf) is perfect and the correlation of XYZ is 100%. It could just as easily be the opposite, where the correlation of XYZ is 0%. You have no way of knowing. All of the data measurements you have collected cannot reveal any pieces of the XYZ interior.



Thus, you must assume the error boundary on XYZ is 100%. There's no way to calculate otherwise. If you want to calculate XYZ, you must measure XYZ. No modeling or bias correction can compensate for this. There are two outcomes in data fusion: you measure so you can calculate the error bars, or you make a wild guess.

17 April 2014

We are seriously considering running JavaScript server-side (not node) to solve a problem. I'm simultaneously terrified and delighted.
Published Screen Filter version 15 with KitKat support, a better theme, and new icons. 2.4 million installs.
Excited about PourOver from the NYTimes:

PourOver is a library for simple, fast filtering and sorting of large collections -- think 100,000s of items -- in the browser. It allows you to build data-exploration apps and archives that run at 60fps, that don't have to to wait for a database call to render query results.

14 April 2014

Canada still has working pay-phones and they're excellent.

13 April 2014

Fan-in and Fan-out: The crucial components of concurrency

Here's the video from my PyCon2014 talk!

11 April 2014

My talk from PyCon 2014

Code samples are here. Slides are embedded below (use slide forward/back buttons for best effect). Or download a PDF of the slides.



Video hopefully will be uploaded after the conference and I'll repost that too. Update: Here's the video!
Why do we need Tulip? I'll explain the motivation for Guido's asyncio module at my #pycon talk at 5:10pm today!

06 April 2014

Tip: If you aren't getting GitHub notifications properly, switch your primary email account to something else and then back to the original again. This kind of "solution" does not inspire confidence in a platform.

If you're looking for a starter project on Camlistore, build an importer that does this. Groundwork for indexable OCR.
Interesting paper that shows a connection between P ≠ NP and quantum behavior. I wonder if that analysis overcomes Gödel's incompleteness theorem, which says "a system cannot demonstrate its own consistency."
Two Python speed-up tools I learned of this week. Shame I can't use these in my production environment. They sound awesome.

Numba
Numba is an just-in-time specializing compiler which compiles annotated Python and NumPy code to LLVM (through decorators). Its goal is to seamlessly integrate with the Python scientific software stack and produce optimized native code, as well as integrate with native foreign languages.

Theano
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features: tight integration with NumPy, transparent use of a GPU...
Epic post about computer vision and machine learning.

03 April 2014

Why doesn't this guy just start a blog? I don't get it.
Woah: Pyston: an upcoming, JIT-based Python implementation. More details.

02 April 2014

Nice demo of how btrees speed up memory access in modern architectures.
"Performant" is still not a word.

31 March 2014

Good explanation of how to make JSON serialization faster in Go.

30 March 2014

Best thing I've read about the Mozilla debacle so far.
Nice little guide to running Go on an Android phone.

29 March 2014

Interesting update about Auroracoin, an attempt to give every Icelandic citizen cryptocurrency.

28 March 2014

Danah Boyd on how hormones affect vision and thus perception of VR (the real title of this article is awful).

27 March 2014

Saw Kraftwerk at the Fox Oakland. Lifelong dream! I got to keep the 3D glasses.

How to reheat pizza

Recently I was at one of my favorite pizza places on Earth, an old family style establishment in Lower Manhattan.

There were a few slices left over, and upon returning with them wrapped in foil our gracious host dropped this knowledge on me:

When the pizza arrives, it's hot: On the top you have molten cheese; below that is thin sauce; then you have soft warm bread followed by that nice crunch of the crust. But as it sits there and time goes by, the cheese firms up, the sauce thickens down into the bread making it soggy, and the crust softens.

When you get home you need to reverse the process.

Get a nice skillet and heat it up. Put the pizza down. You can use butter or oil, whatever. Now think about it: As it heats up from below the crust will get crunchy again; the bread will heat up and soften; the sauce will rise out of the bread and thin; the cheese will heat up and melt. You'll be back where you started.

I followed this man's directions and there's no doubt he's right. This was the best reheated pizza I've ever had. How have I gone a lifetime without knowing this method?

(PS: Yes, I'm paraphrasing his words. I wish I could have recorded the original)
I just flew in from Null Island and boy are my pointers expired.



Actual picture from the display on my flight after landing. (Sorry for the fuzzy photo)
Oldie but a goodie: Overview of cardinality estimators and their tradeoffs.
Notch weighs in on Oculus. This is the kind of post that platform creators dread.

He also linked to this amazing description of the existential crisis you feel when you leave good VR. This is futurism.

21 March 2014

The best analogy I can come up with.

19 March 2014

I can spot the bytes of a pickled Python object from 10000ms away.

14 March 2014

This post about building pipelines in Go is awesome. I would love to see this for all languages.

13 March 2014

With Sony open sourcing their Authoring Tools Framework and Valve their OpenGL debugger, maybe there's a coming renaissance in game development and shared tooling? Or maybe the tools are crap anyways.

12 March 2014

Streamtools looks like an interesting visualization of data pipelines. But I don't think graphical programming for this makes sense. So much of data analysis of any kind is "cleaning" the data to be sure you're counting the right things. That's usually the ugliest type of code you can write.

And Rust won't be fun anymore.

11 March 2014

Roundup of features in the new Javascript (aka ECMAScript 6).

10 March 2014

Finally some half-way decent networking gear from Intel at 800 Gbps per cable.

In spite of my code's outward appearance, I shall try to write a nice test.

09 March 2014

Catching up on 6 months of GitHub notifications I missed. Why doesn't it email you about pull requests and issues anymore? I feel dumb.

08 March 2014

I stopped showing icons on my desktop with this:

defaults write com.apple.finder CreateDesktop -bool false
killall Finder

And it's been great.
Autodesk Fusion 360 is the buggiest, crashiest piece of software I have used since Adobe Premier.
© 2009-2014 Brett Slatkin