Finding Characters
It’s hard to find one certain character in over 155129 codepoints. This site aims to make it as easy as possible with the following search options:
- Free search: Just press the “Search” tab above or use the form on the front page and type a query. In many cases the codepoint in question is in the result.
- Extended search: You can configure on this page every Unicode property of the codepoint in question.
- The “Find My Codepoint” wizard: Answer a series of questions to get to your character.
If you happen to already have the character in question just paste it in the search box. It will bring you directly to its description page.
Do you know the character’s shape?
You don’t know the name or any properties of a codepoint but its general look? Fear not, on Shapecatcher you can draw the character and get it recognized. This works remarkably well for many characters.
Advanced Options
If you know Unicode and also know the rough range, where the codepoint might be, you can give the range directly in the URL. E. g., to inspect characters in the range U+0200 to U+0300, enter in the address bar “codepoints.net/U+0200..U+0300”.
About Unicode
Computers use 0’s and 1’s to store information. To get useful information out of that, in our case to display text, we need a so-called encoding, that tells the computer how to transform those 0’s and 1’s into an alphabet. The first standardized encoding was ASCII, which basically assigns simple Latin upper- and lowercase letters as well as numbers and some punctuation, all in all 128 positions. The W3C has published a very good introduction to the topic of character encodings.
128 positions didn’t last very long. Many institutions and companies began to implement their own encodings. In 2010 there were a whooping 250 encodings widely used, not counting some obscure or privately used ones. This situation proved disastrous, when computers started to talk to one another over the Internet. If the sender didn’t specify the encoding of a message, there was a good chance the receiver would only get a stream of nonsense and rubbish.
Thus enters Unicode. Adobe and Xerox decided in 1984, that this was no situation to continue, and that there is a need for a universal encoding scheme. 1991 saw the publication of the first version of Unicode with the international standardization as ISO 10646 following two years later. (Fun fact: ASCII is standardized in ISO 646, the number for the Unicode standard was deliberately choosen.) Meanwhile the Unicode Consortium began to form in order to guide the further development of the standard.
The most recent version of Unicode is 16.0.0, containing 155129 characters in over 100 different scripts. It’s encoding form UTF-8, a superset of ASCII, is the most popular encoding worldwide and the consortium counts Apple, Oracle, Microsoft, Google, IBM, Nokia and many others to its members.
Unicode is a mechanism for universally identifying characters. All characters get an assigned “codepoint”, which universally refers to them. For example, the letter “A” has the codepoint 65 assigned, the chinese character “㐭” the codepoint 13357. Codepoints are usually represented in hexadecimal notation, where “A” to “F” represent the numbers 10 to 16.
To bring the sheer mass of the possible 1,114,111 codepoints in a useful order, Unicode is divided in 17 planes, which are further divided in logically connected blocks. There are ten principles, that guide the extension and care of the Unicode standard:
- Universal repertoire: Every writing system ever used shall be respected and represented in the standard
- Efficiency: The documentation must be efficient and complete.
- Characters, not glyphs: Only characters, not glyphs shall be encoded. In a nutshell, glyphs are the actual graphical representations, while characters are the more abstract concepts behind. Glyphs change between typefaces, characters don’t.
- Semantics: Included characters must be well defined and distinguished from others.
- Plain Text: Characters in the standard are text and never mark-up or metacharacters.
- Logical order: In bidirectional text are the characters stored in logical order, not in a way that the representaion suggests.
- Unification: Where different cultures or languages use the same character, it shall be only included once. This point is rather debatable, because in East Asia the separations, where this rule is to apply, are not that clear.
- Dynamic composition: New characters can be composed of other, already standardized characters. For example, the character “Ä” can be composed of an “A” and a dieresis sign.
- Stability: Once defined characters shall never be removed or their codepoints reassigned. In the case of an error, a codepoint shall be deprecated.
- Convertibility: Every other used encoding shall be representable in terms of a Unicode encoding.
About Codepoints
This website is a private project coordinated by Manuel Strehl. It is not affiliated with or approved by the Unicode Consortium. You can contact me via:
Manuel Strehl
℅ Kinetiqa GmbH
Bischof-von-Henle-Str. 2a
93051 Regensburg, Germany
The Content on this Site
The content on this website reflects the information found in
The Unicode Consortium. The Unicode Standard, Version 16.0.0,
(Mountain View, CA: The Unicode Consortium, 2022. ISBN 978-1-936213-32-0)
www.unicode.org/versions/latest/,
which happens to be the most relevant version of the Unicode Standard
as of November, 2022.
If you find problems, inaccurancies, bugs or other issues with this site, please e-mail me or issue a new bug at the bug tracker. The source code for this site is live on Github . If you like, fork the code, enhance it and send me a pull request. (If you don’t have a Github account, please send the git patch via e-mail.)
There is no warranty, that the content on this site is accurate, complete or error-free! For normative references please refer to the Unicode website itself.
Re-use License
You may re-use all content on this site, given that you respect the following terms. The information regarding Unicode is licensed by the Unicode Consortium under the Unicode Terms of Use. The JavaScript part contains libraries under different licenses, mostly the GPL and/or the MIT license. See the page source for details. The graphical representations use glyphs from the following fonts:
- GNU Unifont, released mainly under the GNU Public License, partly under a liberal re-use license
- Historic Fonts by George Douros, released free for re-use
- MPH 2B Damase, released under the GPL
- Deja Vu, released under the Bitstream Vera license
The images representing single Unicode blocks are taken from the font Unidings by George Douros, released under a permissive license. The quotes from Wikipedia are subject to the Creative Commons Attribution Share-alike license. Details can be obtained by following the respective link on each quote. The geographic localization of blocks (used in the “Find My Codepoint” wizard) is based on the categorization on decodeunicode.org, published under the CC BY NC license.
All code provided specifically for Codepoints.net is released under both the GPL and MIT license, with the licensee free to choose. Content genuine to this site is released under the Creative Commons Attribution 3.0 Germany. Attribution in this case is a simple backlink, optionally with the link text “Based on information from Codepoints.net”.
Privacy, Statistics
This site uses Matomo to gather statistics about page views. The sole purpose is to enhance this site. If you don’t want your visits to be tracked at all, please follow these instructions:
Attribution & Credits
First of all we’d like to thank the contributors of the Unicode Consortium, who work to standardize the essential part of computation, the display of characters. The same holds for the authors of Wikipedia, who gather knowledge about many parts of the lettering universe. Their share is an important part of this site.
The Polish translation is kindly provided by Janusz S. Bień, utilizing the terminology introduced in his paper “Standard Unicode 4.0. Wybrane pojęcia i terminy” and subsequent publications.
The developers supporting this site with their knowledge, bug reports and input take a fair share in keeping it awesome. We want to thank specifically the people contributing code:
Many thanks go to two sites with a similar goal but other emphasis in the presentation of the Unicode standard: Decode Unicode and Graphemica.
The WHATWG publishes an encoding standard, that is used here for additional encoding information for codepoints. Its main editor is Anne van Kesteren.
The hosting is done on Uberspace, a phantastic provider with extremely helpful and flexible support.
The LATEX names are derived from www.w3.org/Math/characters/unicode.xml, which is curated by David Carlisle and provided together with the MathML specification of the W3C.
Fonts
Many people base their work on Unicode. We want to thank the authors of these fonts, that they made it possible to re-use them for this project:
- Roman Czyborra, David Starner, Qianqian Fang, Changwoo Ryu and Paul Hardy for GNU Unifont
- George Douros
- Mark Williamson
- The Deja Vu Project
- Michael Everson for the Last Resort font
Image Attribution
The background image on the front page is released under the Creative Commons Attribution license by Flickr user Willi Heidelbach. The button backgrounds on the front page are in the public domain: map of Charlemagne’s empire, 18th century dowser, and NASA Mars Rover.
The “We’re Open Source” image is released under the Creative Commons Attribution Non-Commercial No-Derivations license by Flickr user tima.
The icons are part of the Font Awesome icon set.
Finally I’d like to thank Mathias Bynens for pushing me to publish this site at last.