The Daredevil Camera

Once upon a time I was reading a Popular Mechanics article, the title of which eludes me. Something about playing different music for different parts of a dance floor. They were describing a way to focus sound towards different people.

What struck me about the idea was that there was a way to focus sound. It was a piece of mesh of some sort, which acted as a lens for ultrasonics. This sparked an idea for what ended up being the most complex and expensive of my hobby projects to date.

Imagine using such lenses to focus sound onto a plane of microphones. Just like light in a camera. One microphone is one pixel. An ability to see sound.


Duga 3 radar, image from Wikimedia commons.

I didn’t actually read Daredevil comics until much later, but those who have can see where this is going.

For a long time it was just something at the edge of my mind. I was envisioning things like watching ambulances drive past, their color shifting from blue to red, or a firework’s bang lighting up the buildings one by one.

The initial idea was something huge: a truck-sized box with mesh optics and a board of microphones. Completely and utterly impossible to make.

Eventually I had the idea to make the scanning optical camera I described in the previous article, and wondered whether I’d be able to do sound with it as well.

I planned the camera with that in mind – the size of the hole, and the size of the box itself. It should have been able to resolve a basic image in ultrasonic ranges, where waves are short enough, using a simple microphone scanning head.


I made some wave propagation simulations to test the idea.

It should have gotten a few pixels, at 16KHz, with the microphone scanning head filtering and detecting that exact frequency (to avoid motor noise and so on).


Later, I ran the experiments for real, and got nothing but noise.


In hindsight, the inner surfaces should have been made anechoic, and even with that, the walls are a bit too transparent for the sound.

That idea failed.

I kept contemplating it every now and then, whenever I saw microphones sold in bulk for cheap. Instead of a scanning rig, I wanted to do a full matrix. I wanted to see the world in sound, and not at a frame per 30 seconds, but at 30 frames per second.

The scale was always too big to pull off or afford.

Then, I read an article about the FFT telescope — how you can resolve an image from a grid of wave sensors using zero optics and a lot of mathematics.

That was the first breakthrough: I didn’t need to build the box or the mesh optics! The project suddenly collapsed into something portable and plausible.

Along the way, I understood what Duga-3 really was, the “Russian Woodpecker” radio array you see at the top of the article. Ironically this Soviet-era monster, the largest source of radio noise during the Cold War, is the closest thing to what I wanted to make (only of a portable size).

It is a big grid of radio transmitters that can shape a wave.


Steel structure of Duga-1 from the bottom. Image from Wikimedia commons.

The same principle works in reverse.

Imagine an 8×8 grid of microphones pointing at a tone generator, which is moved left and right. What would the grid hear?

Each microphone gets its sound, which is a waveform. FFT is something that can split this waveform into its constituent frequencies, a set of amplitude and phase for a set of “buckets” representing a frequency. This lets us get the intensities of the sound waves of different frequencies, rather than a trace of the microphone’s membrane going up and down.

Now contemplate a distant sound source. What would the sound hitting the camera be like, when it is pointed straight at it?


Pretty much all mics getting the same values at once: the wave reaches them all at the same time.
Now, what if the source were somewhere to one side of the grid?


The sound would hit the mics on one side a bit earlier than the mics in the centre and the other side. The higher the deflection, the higher is the frequency of this rolling wave. For the source moving left and right, we would get the waves that are slower, then faster, then slower, and so on.


So, what would we get if we apply 2D FFT to THESE waves, and plot them based on the deflection angle?



And that’s all the magic there is.

Sadly, the idea was still too complex and too expensive to pull off.

Every microphone needs a pre-amp.

The outputs of the pre-amps needed to be fed to a fast analog to digital converter (ADC).

The ADC outputs then fed into a field-programmable gate array (FPGA).

A routing nightmare, a soldering nightmare…

Then one day I was fixing my father’s iPhone, and noticed that it had an odd microphone in it. It was a chip with a hole, and looked quite unfamiliar.

Some googling later, I discovered the existence of MEMS microphones.

That was the second breakthrough – a MEMS microphone is etched directly in the silicon, and comes with the pre-amp and ADC already on chip. It’s a DIGITAL OUTPUT microphone!


Suddenly, the project collapsed in complexity, and for the first time it was on the edge of feasibility. With the digital microphones, all I really needed were microphones and an FPGA to process it.
That made for a simple board, and for an affordable Bill of Materials (BOM).

I got a few of the mics and made a prototype with a spare FPGA board from my home automation system.



This only covered one line out of the grid, but it should prove the concept.


I tested the math, and it seemed to work. I could track a sound source and get an 8×1 image, of a sort.
Time to do it for real.

Even then, it was just barely cheap enough. I had to buy enough components to get the bulk prices — over 25 chips, over 1000 microphones — but not so many that the surplus would make the effective  cost too high.

But in between, that U-shaped cost curve dipped just below the line of affordability.


I settled on a 32×32 array, made out of 8×8 cells. Each cell is a self-contained camera, optimized for syncing. The image was to be stored on a microSD card, with a few live output options.


This way I could use a cheap-ish FPGA and cheap-ish board manufacturing. Also, that makes the system flexible. Literally and figuratively.

The boards are to be mounted on a frame that maintains the spacing, with zero force being applied to the boards — the wind should go past them through the gaps.

2 cm between the microphones, 1 cm of gap between the cells.

With 64 channels of data to be sampled at 3Mhz, an FPGA was the only real option. A microcontroller can do one instruction at a time, executing one operation in a sequence: read data, process it, store it. Even the bigger ones would choke with only a dozen of channels.

An FPGA, on the other hand, is a software-defined array of logic gates. You can define 64 pipelines of signal processing, all of which would work simultaneously. They are great for signal processing tasks, at the cost of being much more complex to work with both on circuit level and software level.

A typical FPGA would need several different voltages, proper decoupling and ground planes, input protection, external ROM, and so on. They come with the most complex and convoluted datasheets I’ve ever used. A far cry from a microcontroller you can just drop in and run with.

I took my time while designing the board. There would be no second chances – even in China a run of 22 quad-layer PCBs cost about $600. So I checked and rechecked the design, contemplating everything that could possibly go wrong.


I put in a bunch of options for the unexpected.

MicroSD card for storage.

Spare pattern for a Flash chip, in SOIC and in DIP.

Spare input and output interfaces.

Spare patterns to allow for in-line, pull-up and pull-down resistors.

I went over the design again and again.

It paid off later, as I ended up using most of the “just in case” options. Eventually, I pulled the trigger, and two weeks later China delivered.

Despite my precautions, I screwed up. It turns out the exposed pad of the FPGA MUST be soldered to the ground, and I had no vias to reach it from below.

I wonder what the datasheet writer was thinking when he decided to mention this critical, need-to-know information only in an inconspicuous footnote hidden several hundred pages deep.

Luckily it wasn’t a show stopper, all I had to do is to reflow the chip with the heat gun and a touch of solder left in-between.

But it was a hassle.

Once that was solved, it turned out that the sucker worked!


Well, the LED blinks.

And a little later, I got to the microphones’ data over the debug channel.



A few bad solder joint fixes later…


Much better.

Now, the time had come for the first real test.

There was something magical about it, the trepidation of finally approaching a point when an idea that had been bouncing around your head for a decade was about to become a reality.

After a few months of work and waiting and more work, this was it. I put the cell standing on a table and linked it up.


Then, I went ahead and started waving my phone in front of it, set to generate several tones. Crude software, no frame rate control, bad framing, a hack upon a hack.

But I got a video.


You can see two blobs – one is the phone, and the other is it’s reflection from the table.

A few blobs, but for me that was magical. Seeing sound, for real, for the first time.

I did some work on the software, got the thing untethered, recording to the microSD card as was originally planned, and started playing.

Now the sound source was the PC speakers, and I was standing some distance away, turning the camera left and right.

You might notice the gaps in the video. Turns out microSD cards are not as well behaved as I’d hoped. They have their own internal logic that can cause arbitrary delays, picking their own time to flush the buffers or erase more FLASH blocks. While the average write rate is more than fast enough, the latency is unpredictable.

And my hardware can only store one frame at a time, so there is no way to wait. I hoped to fix this later, one way or another, so I moved on to syncing several cells together.

It takes an evening of boring work to populate one cell’s PCB, so much podcasts later I got myself a 2×1 array.


The small board is the controller, it sends the trigger signals to the cells, letting them start recording a frame at the same exact time.

The frame is stored on the microSD card.

I found that at around 10 FPS I can avoid most of the latency issues, so that became the go-to hack for the moment.

Let’s look around at the glorious 16×8 resolution.


The redder the blob, the lower the frequency, the bluer, the higher.

The lower the frequency, the lower is the precision with which it can be located. So random noises show up as big red blobs.

It has to do with the wavelength — if the wave is much larger than the array, then you can’t really detect its direction any more.

But, something else was wrong. It took me a while and some tweaking to find what it was exactly.

Here is a video of the array sitting still, looking straight at a tone generator.


See the double-single blinking?

For some reason, the cells were not triggered exactly at the same time. A timing error, which causes one of the cells to skip an entire 48kHz sampling rate step. This might not sound like a lot, but it was huge in a system that is designed to measure sound wave directions by determining their phase shifts over a grid. During that 20 microsecond delay the sound travels 6 mm, which is a third of the distance between microphones. That breaks the pattern.

I tried to fix things that looked like the might be the source of the problem, then tried to scale up and clean the array.

Perhaps the issue was in some flaky wiring…




Two more cells in, and here is the same “looking around” performance in even more glorious 16×16 resolution:


There are a lot of reflections visible now – from the walls, ceiling, furniture and so on. Also, the blinking, while more dissolved, is still there. Apparently the timing errors are not gone yet.

At the same time, the microSD card writing just does not work as well as i would have liked. Different cards have different latencies, and even at 10 FPS, I was losing frames after tens of seconds of runtime. Not to mention that removing all the cards to plug them into a bunch of card readers is not quite a frictionless way to get images.

I wanted to explore, to see things in real time, not minutes or hours later, back home, plugging the cards in only to find lost frames.

I needed a non-storage processing pipeline…

I had a bulk interface on the PCBs, in anticipation of a centralized sampling approach. One cell produces 4Mbps of data stream, and I figured it would take another FPGA board to poll them all, process the data and drive an LCD.

However, the processing in question was hugely complex for an FPGA implementation. I would have to make another special FPGA board, and figure out a whole new system, and figure out a way to add a visible light camera into it so I could track the sound of what exactly was I looking at…

For months, the project was dormant. Other projects came and went. One of them helped me figure out how cheap a powerful x86 computer is these days.

And then I realized that I don’t need to make the whole data processing pipeline in hardware. All I needed to do was to get the data out of the cells and into a little PC at the full speed.

These days a powerful enough computer to render for this thing would be about the size of one of the cells, and would come with a bonus of being able to run a plain vanilla webcam for keeping track of what is recorded in the mysterious sound blobs.

Unfortunately, we are still talking about a total of 64 Mbps of data. That needs a USB 2.0 sampling board that would pull the data out of the cells over the bulk interface —  another FPGA, albeit a much simpler one this time.

This would take time to make, to try and debug, and hopefully that second part of the project would deliver the true promise of the sonic vision…

Get Ribbonfarm in your inbox

Get new post updates by email

New post updates are sent out once a week

About Artem Litvinovich

Artem is a hacker and garage tinkerer based in Moscow. His day job is making software for a telecom company. You can check out his projects at


  1. That seems incredible, your ambitious research and development efforts beyond the technology. I’ve read about an article a while ago, an academic research. I’m sure you’ve seen but that had made me excited that time so I’d like to share the link. ( )

    I have a question in mind. Since it uses sound as the source, would it be effective to add another reciever like a stereo recording to compare two results and remove the noises?

    • Not entirely sure how a recorder could help. This is intended to pick up sound like camera picks up light, a sound source is like a flash – useful, but not mandatory.

  2. 2D ultrasound uses a linear array of 5 Mhz transducers and maps the reflections to an image. The delay in the reflection is mapped to depth. As the probe is scanned across the body a cross sectional image is painted. I wonder how that data is processed and stored. No room for latency confusion in that pipeline.

  3. Great work! With the PRUs on a beaglebone black you can clock ~15 bits of parallel data in at 10s of millions of times per second and DMA it into main memory. 64 Mbps (bits, not bytes, right?) should be no problem.

    • close! the beaglebone can’t handle the processing of the data, so then there’s added usb or ethernet latency.

      I’d bet there’s a PCI-e data acquisition card that can handle the task with very little latency

  4. Amazing! I use beamforming systems similar to this as an acoustics engineer – it’s incredible that you have constructed one by yourself.

  5. I have pondered this for many moons as well. In searching around, I found one company that makes a commercial product that does exactly what you are trying to hack together. The mics are in a spiral pattern, in a thingie that looked maybe a half-meter around. It costs something like 32 kilobucks (USD), which is far too much for my gosh-that’s-cool-gotta-have-it budget! Hence, the desire to build one, like you are doing. Just not had the time. Look around, you may find it, I forget the name.

  6. Heh, I made a ten-dollar version in 1981: eight LEDs glued to an old vinyl record album, each with an electret microphone an and op amp. Spin the disk with a motor at about 4Hz. Wind noise, so add foam to the mics. It worked, and the best part was to hold some headphones near it while playing 10KHz sine wave. This produced interference stripes on the scanning disk! I hoped to build a huge version someday, but somebody beat me to it: an MIT student who patented the dataglove, then used the money to work at Exploratorium Museum for a year (as Artist in Residence), building a disk about 6ft wide. Since then we have companies selling DIDSON full-blown ultrasonic video for underwater, also 3D phase-array microphones positioned on a wire sphere.

  7. If you want easy prototyping for live input, just get a Saleae:

  8. Love your vision & execution – bring on the next instalment! :D

  9. Have you tried a SD card that sends data through a WiFi connection, instead of storing it?

  10. Jeromy Evans says

    Amazing persistence!

    This works much better underwater.

    You will enjoy reading articles on Acoustic Daylight. They take the concepts described through to trying to visualize a scene using passive acoustics.

  11. Sounds fun, please keep on expirementing and sharing your process.

  12. Really really interesting hobby you have! I love also that you do not stop after some hikkups. Two years ago an Eindhoven startup developed a slimilar device: and they have really cool 3d analyzing software (youtube). I hope their work will inspire you to continue your hobby and maybe more. I look forward reading your next post!

  13. real time fft of 64 channels at 48kHz… sounds like a ~2GHz core for a single 8×8 cell. I would say that poll-based USB is a no-go for a project requiering strict timing. Have you considered a fpga board with a build in ethernet controller? Then you can pump the data via ethernet to any computer and store it in RAM/SSD drive.

  14. I made a 7000 pixel sound camera a couple of years ago:

    Its images are quite clear and recognisable. The math was tricky.
    I solved a lot of problems by making it slow. Yours is much faster.

  15. Stephen G says

    Very cool. One option we use for comms from FPGA to computers is to put an Ethernet PHY and spit data out to it via a MII interface, which is usually easily available as a logic core. You don’t need a full IP stack to generate UDP packets and you can get fairly low-latency communications with a direct connection or just a layer-2 switch.

  16. Norvan Gorgi says

    Wow. So incredibly badass!

  17. Great project!
    For processing data from those boards you can try to use this:

    SOC with dual ARM core, integrated FPGA (rather small), bunch of IO’s and 16 core accelerator. And it’s not very expensive. Also can run Linux and has HDMI output.

  18. Mandar Chitre says

    Cool hobby project!

    You might want to check out this acoustic camera that does exactly what you want, but at a slightly larger scale:

  19. Hi!
    One thing you might want to look at for your SD card is the size of the clusters.
    This can have a tremendous impact on SD access performances, depending on your implementation of the FAT access layer, and the size of the files you write.
    In a project I worked on, we had problems with SD formatted with Windows or GParted. Theses tools try to strike a balance between access speed and space lost when writing small files, but that might not be the best for your application.
    Bigger cluster size will get you access to more sectors without the need to go back to read the FAT, and in implementations like FATFS this makes a world of difference.
    Example : if you have clusters of 8 sectors, and sectors of size 512 bytes, every time you write 4kBytes you have to go back to read the FAT to know where to write stuff next. FATFS has a cache of a small part of the FAT to avoid accessing the SD card each time it needs to find the address of the next sector but if the next address is not in it, then it will have to re-read sequentially many chunks of the FAT in the SD to find the FAT entry for the address you want, giving you long delays in the middle of writing your file…
    With clusters of 512 sectors for example, you instead get to write 256kBytes before the next time the FAT has to be read, potentially making things that much faster. The number of entries in the FAT will also be 64 times smaller, resulting in a much smaller time spent searching the FAT too.
    The downside is that anytime you start using a cluster for a file, the whole cluster is reserved for it, even if it’s smaller. So a 10 Bytes file would take 256kBytes of the SD card, an image of 257kB would take 512kB… but in the age of SD cards with hundreds of GB it should not be too much of a problem.
    All values of cluster size are not supported by all FAT access layer implementations, so you would have to test with progressively larger cluster sizes.

    With tools like mkdosfs you can choose all the parmeters of your FAT.

  20. Wow, that’s some impressive work, and inspiring determination!

    I wonder if you could use this to get a 3-D scan of the room. If you hooked up a signal generator to a speaker then the strength of the reflected soundwaves received would give an indication of distance. Is it feasible without sweeping like sonar…

  21. Joey Bloggs says

    Dangerous Prototypes has the FT2232H breakout which is a USB 2.0 device allowing up to 320MBit data transfers over USB.

  22. Sebastian Egner says

    Really cool project!

    Interpretation of the sound image is much easier if you overlay it with visible video. Have a look at these people from Berlin:
    (It is a spin-off of

  23. Xmos had put together some demonstrations and reference hardware for up to 32 MEMS mics.

  24. This is pretty legit! Have you considered using this as an aid for the deaf?

  25. Robert Spies says

    Many years ago I came up with an idea for a `real-time’ `Radio Camera’. I first learned of the idea of a radio camera in the writeups of the radio science experiments at Platteville, CO, in the 1970s That system had 32 antennas in a 1kM diameter circle. Each antenna had a receiver and a digitizer. The data was resolved off-side by a main-frame computer. They successfully imaged the ionosphere with about a two-degree resolution in two dimensions.
    My idea was to convert such a system to a `real-time’ radio camera using a standard NTSC video monitor for the display. The outputs of all the receivers would by summed to one signal, to make the `video’ signal fed to the monitor CRT. I proposed using a digital delay system in the signal path from each receiver to the summing point. If you map the video display to an area of the sky, the delay needed for each receiver at each pixel point can be calculated rather easily. Note, however, there is a single scanning point, so the delays must be dynamic, slaved to the TV scan. The amplitude of the H and V scan signals for a particular antenna are adjusted to suit the location of the antenna in the array. Additional delay input can be added to provide `beamforming’ i.e., to look at different parts of the sky, magnification, range. While the Platteville antennas were arranged in a circle, I propose have as many antennas, receivers, delays as possible — randomly located over an area at least several wavelengths in extent in each direction. If the `radio camera’ system is used in conjunction with an ionosonde, `range’ gating’ becomes possible. The receivers should be tunable, all operating from common hetrodyning oscillators. Great detail must be paid to the RF phase at each point in the system.

    The same idea could be implement in an acoustic system. In the days of my original idea, implementation would have been difficult. Nowadays, an FPGA in each receiver channel would make the system rather easy to build. Use microphones in a random array to reduce artifacts of an array. I thought of producing analog ramp signals following the TV H and V sweep, with zero voltage representing the middle of the screen. The circuitry for each antenna/microphone would include two potentiometers, one for H, one for V. The setting of each pot would be a representation of where the antenna/microphone is in the array. The analog output of each pot would go to and AtoD to control the H or V delay in each channel FPGA. The overall amplitude of all H and V signals would control the `magnification’ of the system. Note that the `bandwidth’ of the resulting video signal will likely be much greater that the bandwidth of the received signals. Each antenna/microphone `channel’ would be identical, and equipped with one H and one V `mapping potentiometer. Should not be too hard to build — but probably beyond the means of an individual.

  26. Thank you for a very interesting article that I have linked to from The Hayfamzone Blog which is all about comic books.

  27. This looks really awesome!

    I don’t really know much about SD Cards, but have you tried using one of those class 10 cards? Some of them are meant for 1080p or even 4k video recording. Maybe you could get better results with one of those.

  28. That is very interesting; you are a very skilled blogger. I have shared your website in my social networks!

  29. I came, I read this article, I conrueqed.

  30. Robert Spies says

    Many years ago, I had an idea for a `Real-Time Radio Camera’. The same principle would apply to a `Real-Time Sonar Camera’. In both radio and acoustics phase is critical. Each receiver/microphone preamp must have identical phase characteristics, and also matched gain. To make an image from an array of antennas/microphones, the signal from each antenna/microphone must be delayed by some small amount, depending on the location of the particular antenna/microphone in the array, and also the desired `look angle’ and the `solid angle’ to be imaged. The `position in the array, and the area to be scanned are constant, while other parameters are derived from the `scanning’ of the display — I intended to use a standard TV monitor. For the delays to work, a small amount of `shift-register’ memory is needed, and also a means to select a particular step in the memory as a function of the summed parameters. This part of the system would be implemented digitally, using and FPGAs or other programmable logic — one for each antenna/microphone. Since so much of the circuitry is identical, it is amenable to mass production of the circuits for each antenna/microphone. The final output of all the FPGAs is simply summed, and used to make a standard video signal for the monitor. The summing is most easily done by using a `current output’ DtoA in each channel. With some additions, the shift registers could be stopped, for continuous examination of a particular image. In addition, with a pulsed source, and some synchronizing circuitry, `range gating’ could be implemented.

    The original inspiration for me was the `Radio Camera’ in Platteville, CO in the 1970s. This system was not real-time. The Platteville `Radio Camera’ was used in conjunction with an Ionosonde and an Ionospheric Heater to study the Ionosphere, the radio-reflecting layer of the high atmosphere. This was a pre-cursor to HAARP. Radio Science is no longer done at Platteville. A radio system will also need to be `tuned’ to a particular radio frequency, and if the antennas are `crossed-dipoles’, polarization effects can be observed.

    If the Platteville `Radio Camera’ was used alone and tuned to a distant short-wave transmitter, it was easy to see that during `magnetically quiet’ times, the Ionosphere looks like the inside of a Christmas tree ball at the radio frequency — the transmitter made a single spot in the sky. During a `magnetic disturbance’ the ball looked as if it was made of crinkled Aluminum foil — the distant transmitter was imaged in many parts of the sky — thus `selective fading’ with interference in a single antenna.

    A sort of super deluxe Diversity receiver could be made with an additional set of FPGAs — without the scanning inputs, but instead settable scan point values, (which could be derived from a touch-screen’), or better yet, from circuitry that watched for the strongest video spike and grabbed the instantaneous scan values.

    During construction, it is important to note that the `bandwidth’ of the resulting video signal will likely be much greater that the bandwidth of the receivers.

  31. Hey There. I found your weblog using msn. This is a really smartly written article. I will make sure to bookmark it and return to read extra of your useful information. Thanks for the post. I will certainly return.

  32. bonjour a toi quel jolie cul je suis de la region d albret je suis pret a assumer tous tes desirs je mesure 1m78 75 kg beaucoup d experiences j attend de tes nouvelles bye

  33. Miervaldis> saka:Zinu to visu, ko tu man stāsti, bet tik vien piebildīšu, ja pie mums nebūtu krievalodīgo okupantu, tad valsts izsaimniekotājiem būtu daudz grūtāk tikt pie siles, un pilnīgi tu pareizi pateici par mūsu valdību!Mūsu valdība ir pidarās cūkas, kuras aiz saviem aizgremotajiem žokļiem neredz kur pist!