Seventh String Software banner

A reflection about automatic transcription of music

I am assuming here that we are talking about taking a recording of some music and producing written music in standard notation. The purpose of this would be either to enable a musician to play the piece by reading the music, or perhaps just to understand the music better, for educational purposes.

From time to time I get messages from people telling me that they have solved, or are about to solve, the problem of automatic transcription. When I ask them for more information it generally turns out that they have written a program to do what I call "note detection" on simple material. That means taking a musical recording and detecting what notes are being played at what time. This is not very difficult as long as it is something simple like solo piano, and Transcribe! already does this - see the Piano Roll view. What they generally don't understand is that getting from note detection to a written transcription in standard notation is absolutely not a trivial task, and in fact is much harder than basic note detection. Of course note detection also becomes very difficult when the musical material is more complex and has more instruments playing.

This note is addressed to those people, and to anyone else who thinks that computers should be capable of automatically transcribing music from recordings.

Many people, especially if they have not had much experience of transcribing music, imagine that this is a process of identifying the notes being played and translating them into notation. This is not entirely false but it is a dangerously simplistic view. Written music is a series of instructions to the musician, which the musician will interpret as they see fit. Therefore the true purpose of transcription is to try to decide what series of notational instructions would cause a musician to play the music that we are hearing, and write down that series of instructions. The process of transcription from sound to notation is not really a translation, it is more like reverse engineering.

To illustrate this with an analogy, imagine someone is sitting in a chair and they have a slip of paper with some instructions. The instructions say:

  1. Stand up, turn a full circle, and sit down again.
  2. Scratch your head with your right hand.
  3. Clap your hands three times.
  4. Smile.

The person carries out the instructions and the computer's job is to look at the video and try to figure out what instructions they were following. Without going into details, I trust you can see that this is not easy. If the computer generates instructions saying "move this leg in this direction, raise that arm..." then these instructions will be almost useless. We can imagine a person trying to carry them out, and then finally saying "Ah! You mean that you want me to turn around". For a successful transcription, the computer needs to recognise the person's intention (to turn around) and write that down.

Suppose a piano player plays a run, but some of the notes overlap because the second note starts before the first has been released. Should we notate these overlaps? If we do then the result will be a mess, hard to read. Probably we should just notate a run with no overlaps. We should notate the intention not the execution. A good musician does not play rhythms in the way that a strict definition of the note values would suggest - if they did, it would sound mechanical, unmusical. To notate a played rhythm we must figure out the intention, the meaning of the rhythm. This is what quantisation is about of course, and it's not easy to get it right. In general a knowledge of the musical style and of the particular instrument is necessary, in order to produce a useful transcription. For instance if a guitar is strumming, then a complete transcription of every note played is probably not appropriate. Instead, chord symbols with an indication of the strumming rhythm would be more useful. If we did present a guitarist with a complete transcription of every note then it would be very complex (because the notes in a strummed chord are not played simultaneously), and we can imagine the musician struggling with it for a while before saying "Ah! You mean that you want me to strum a G major chord".

All rhythms are notated in relation to the beats of a measure (bar) so before you can even begin to think about how you will notate the rhythms you must decide where the beats and the measures are. For some material this may be easy, for instance if there is a bass drum hitting the first beat of every measure, but as a general matter it is not necessarily easy at all for the computer. Even musicians can find this difficult if the musical style is unfamiliar to them. And if you get it wrong then your transcription will not make much sense.

The system of standard musical notation has existed for hundreds of years, and has evolved so that all the things which are normally played in western music have conventional representations on the page. For example, to write a chord consisting of B, Eb and F# is almost certainly wrong, even if the notes are correct. It should be B, D# and F#, or possibly Cb, Eb and Gb. Similar issues apply to rhythms, key signatures, time signatures, and indeed every aspect of written music. There are always a small number of ways of notating something readably and infinitely many ways of notating it badly, which will cause musicians to recoil in horror when you ask them to play it.

And now here's a sobering thought : all the issues I have so far discussed, apply even if the computer starts with a perfectly accurate list of which notes were played at what time. But when the computer is starting from an audio recording, with anything other than the very simplest material, the list of notes it detects will have many inaccuracies. This will multiply the problems many times over. I'm not saying it's impossible - after all, a skilled musician can do it - but I am saying that it might well be AI Complete. Also see the footnote at the bottom of this page.

Then you must ask what the output of your program will be used for. The chances are that it will, on anything but the simplest material, contain large numbers of mistakes, or notations which might be technically correct but unreadable. So the first thing the user will need to do is correct them. This means that the user needs to be capable of transcribing themselves, so your program does not replace the skills of a human transcriber, the best it can do is save them a bit of time. If your program manages to produce something that's close enough to be used as a starting point then it might be useful. There are already various programs which attempt to do this, see here, but I have yet to hear of anyone who finds them useful.

Of course if you are aiming to output MIDI then the problems are rather fewer. MIDI does not distinguish between D# and Eb, and the question of how to notate rhythms does not arise. Your main problems are identifying the notes in the audio, and locating the beat and deciding how many beats to the measure (bar). This is also not easy though it can be possible on some material. If you are producing midi output though, you have to ask whether it will be useful, and if so what for? Note that if you don't correctly identify where the beat is and how many beats there are in each measure, then you will get a MIDI file which might play ok but which will be difficult to make sense of when you load it into a MIDI editor. The beat detection problem disappears if we are talking about a performance which was played to a MIDI click track, and in this case useful results can be achieved on some material. But then we are no longer talking about the general question of transcribing music from an existing audio recording.

I've mentioned the importance of knowing what the transcription is to be used for, and here I will say a little more about this. If the transcription is being produced out of academic interest, or for educational purposes, then the ideal would be to produce a full score with a stave for each instrument used in the original recording, which of course can be very difficult. For this approach you will need a thorough knowledge of what the various instruments are capable of and how they are played - for instance, suppose you are transcribing a solo guitar performance. In that case we know that what we are hearing is somehow playable on a solo guitar, and this is very useful information in guiding our transcription - if we find ourselves writing something that cannot be played then either our transcription is wrong, or the player has used some double tracking on the recording, or the player is using an ingenious playing technique that we are unaware of. In my experience professional musicians generally transcribe music for entirely practical purposes, typically because they want to play the piece themselves. This means that the transcription needs to be tailored for the combination of instruments it will be played on, regardless of the instruments used on the original recording. There is no point in producing a detailed transcription of the drum part, if you intend to play the piece on solo piano. In some ways this can make the job easier as you can ignore many details - for instance, suppose that there is a C major chord being played on a keyboard. It may not be easy to decide exactly how the chord is voiced, or whether the guitar is also present in the mix - but quite possibly you don't need to know these things for the purpose of your transcription. On the other hand, you will find yourself facing many judgement decisions about which details of the original are important and must be retained, and which can be discarded. Does the piece have a distinctive bass part that should be transcribed note for note, or is it sufficient to give the bass player chord symbols? Is there a distinctive guitar lick which needs to be present or people won't recognise the song? My point here is that you must choose an approach depending on the purpose of your transcription, and whatever approach you choose, you will face great difficulties. Incidentally writing chord symbols can be quite an art too, depending on the complexity of the material. E.g. if additional notes appear due to the presence of a moving line, should we include those in our chord notations, or should we simplify? How should we indicate rhythmic accents?

Footnote 2024 — Transcription by Artificial Intelligence

The new generation of AIs are making amazing progress in creating natural language text, and they handle grammar better than many native speakers. The problem of notating music in a readable form, as opposed to notations which might be correct in some technical sense but unreadable for a human, could be seen as a grammatical problem. This might make it possible for an AI to overcome many of the difficulties described above.

I have no intention of tackling this, but for anyone who has, here is the sort of conversation we would want to be able to have with such an AI:

User: I want you to do a transcription of this song. [uploads an mp3]
AI (after a pause): I'm hearing a singer, backing vocals, guitar, piano, bass, drums and a horn section. Do you want a full score, individual parts, a lead sheet, or what?
User: I want an arrangement for singer and four-piece rhythm section. Forget the backing vocals and mark the horn figures on the keyboard part.
[AI produces some pages of music].
User: Ok now take it up a tone to G major and do it in 4/4 instead of 12/8, with a note at the top of each part to say that eighth notes are to be swung. Also put in some rehearsal letters.
[AI produces some pages of music].
User: Now we need some repeats. The section B-C-D is basically the same as the section D-E-F so do it as a repeat with first and second time bars. Then at letter H, instead of writing it out to the end, you can use a DS al Coda.
[AI produces some pages of music].
User: Now simplify the guitar part - the guitar should play the unison line at letter C and each time it occurs, but otherwise just use chord symbols.
[AI produces a new guitar part].
User: Ok now add the title and composer, and show me the pagination.
[AI shows the pages].
User: Now print it.


Recommend this page to others, on these social network sites: