Home Forums The XBRL API Lists of similar concepts

Viewing 46 reply threads
  • Author
    • #119805
      Tim Bui

      Hello, as companies have the freedom to create unique names of their tags (concepts), does XBRLUS keep a list of all similar concepts on the edgar_db database so that the users can save time in doing standardization?

      For example, to get a simple Total Revenue, so far I found 5 different concepts:

      1. Revenues
      2. SalesRevenueNet
      3. SalesRevenueGoodsNet
      4. TotalRevenuesAndOtherIncome
      5. RevenueFromContractWithCustomerIncludingAssessedTax

      To find Nonrecurring charges (such as impairment and restructuring), so far I found 13 concepts:

      1. Assetacquisitioncharge
      2. AssetImpairmentCharges
      3. ImpairmentOfLongLivedAssetsToBeDisposedOf
      4. RestructuringCostsAndAssetImpairmentCharges
      5. ImpairmentOfIntangibleAssetsIndefinitelivedExcludingGoodwill
      6. Impairmentandrestructuringexpenses
      7. RestructuringSettlementAndImpairmentProvisions
      8. GoodwillImpairmentLoss
      9. RestructuringCharges
      10. RestructuringCosts
      11. RestructuringChargesAndAcquisitionRelatedCosts
      12. GoodwillAndIntangibleAssetImpairment
      13. ImpairmentOfLongLivedAssetsHeldForUse

      The above 13 non-recurring items are exclusive of the additional 7 gain/loss concepts below:

      1. GainsLossesOnExtinguishmentOfDebt
      2. DerivativeGainLossOnDerivativeNet
      3. GainLossOnSaleOfBusiness
      4. GainLossOnSaleOfNonstrategicBusinessesAndAssets
      5. GainLossRelatedToLitigationSettlement
      6. GainLossOnDispositionOfAssets1
      7. GainLossOnDispositionOfIntangibleAssets

      Thank you!

    • #119925
      David Tauriello

      Tim – we don’t maintain lists like the ones posted here; you’ve done a great job to create a few groups for standardizing elements and tagged it so others can find it easily and post additional groups or additions to those listed above.

    • #119958
      Tim Bui

      For trend analysis, we need to look at historical numbers. However, some companies changed the tag names so it is very difficult to pull a consistent list of numbers using the API. If anyone has a solution, please share with me.

      For example, Microsoft (CIK: 0000789019) used 4 tags for its total revenue over the last few years:
      – For FY 6/18 (which has data for 6/18, 6/17,and 6/16) it used “revenuefromcontractwithcustomerexcludingassessedtax”
      – For FY 6/17, it used “SalesRevenueGoodsNet”
      – For FY 6/15, it used “SalesRevenueNet”
      – For FY 6/10, it used “Revenues”
      Using only one of the above tags will get only 3 years of data. If we use all 4 tags, then we have duplications such as the list below. Even with 4 tags, I am not able to get Total Rev for FY2018.

      entity.cik period.fiscal-year period.end concept.local-name fact.value
      0000789019 2009 2009-07-01 Revenues 58437000000
      0000789019 2010 2010-07-01 Revenues 62484000000
      0000789019 2009 2009-07-01 SalesRevenueNet 58437000000
      0000789019 2010 2010-07-01 SalesRevenueNet 62484000000
      0000789019 2011 2011-07-01 SalesRevenueNet 69943000000
      0000789019 2012 2012-07-01 SalesRevenueNet 73723000000
      0000789019 2013 2013-07-01 SalesRevenueNet 77849000000
      0000789019 2014 2014-07-01 SalesRevenueNet 86833000000
      0000789019 2014 2014-07-01 SalesRevenueGoodsNet 72948000000
      0000789019 2017 2017-07-01 SalesRevenueGoodsNet 57190000000
      0000789019 2017 2017-07-01 SalesRevenueNet 89950000000
      0000789019 2016 2016-07-01 SalesRevenueGoodsNet 61502000000
      0000789019 2016 2016-07-01 SalesRevenueNet 85320000000
      0000789019 2015 2015-07-01 SalesRevenueGoodsNet 75956000000
      0000789019 2015 2015-07-01 SalesRevenueNet 93580000000

      • #193159
        Peter W Reed

        Tim I have the same experience as you. I have found label mapping in the “XBRL TAXONOMY EXTENSION LABEL LINKBASE” file to be the best source for mapping common names with XBRL names. Below is an example from Microsoft’s latest 10-Q filing. It maps “us-gaap:NetIncomeLoss” to the common name “Net income”. The Python package “xmlschema” is the best library I’ve found to parse Edgar xml files. BeautifulSoup does a great job too. I prefer BeautifulSoup due to familiarity and the time it takes xmlSchema to execute (15 seconds) is too long.

        To get the history you mention parsing needs to be done on each individual filing, yuck. Even then additional data cleansing is required. I know a better way is out there.

        <link:label xml:lang=”en-US” xlink:label=”us-gaap_NetIncomeLoss_lbl” xlink:type=”resource” xlink:role=”http://www.xbrl.org/2003/role/totalLabel”>Net income</link:label>

      • #193173
        Tim Bui

        Thank you for sharing the info on label mapping, Peter! I am reading this topic now. I also need to learn BeautifulSoup so that I can parse the xml files.

      • #193175
        Peter W Reed

        I can upload snips from my program to github, if you think there is value. The code is sloppy and meant for personal use only. But it would be a good start with BeautifulSoup. The xmlschema library is very easy to use. I will include code snips for xml parsing too. My assumption is you are familiar with Python data structures like Pandas and default dictionary.

        Below is a sample output of a default dictionary using a list factory. This is from Walmart’s Q1 2016 10-Q filing. There are a total of 216 keys. Below is a small sample of us-gaap mapping to common names (list):

        ‘stockholdersequityincludingportionattributabletononcontrollinginterestabstract’: [‘Equity:’,
        “Stockholders’ Equity, Including Portion Attributable to Noncontrolling Interest [Abstract]”],
        ‘commonstockvalue’: [‘Common stock’,
        ‘Common Stock, Value, Issued’],
        ‘additionalpaidincapital’: [‘Capital in excess of par value’,
        ‘Additional Paid in Capital’],
        ‘retainedearningsaccumulateddeficit’: [‘Retained earnings’,
        ‘Retained Earnings (Accumulated Deficit)’],
        ‘stockholdersequity’: [“Total Walmart shareholders’ equity”,
        “Stockholders’ Equity Attributable to Parent”],
        ‘minorityinterest’: [‘Nonredeemable noncontrolling interest’,
        “Stockholders’ Equity Attributable to Noncontrolling Interest”],
        ‘liabilitiesandstockholdersequity’: [‘Total liabilities and equity’,
        ‘Liabilities and Equity’],
        ‘comprehensiveincomenotetextblock’: [‘Accumulated Other Comprehensive Income (Loss)’,
        ‘Comprehensive Income (Loss) Note [Text Block]’],
        ‘incomestatementabstract’: [‘Income Statement [Abstract]’],
        ‘revenuesabstract’: [‘Revenues:’, ‘Revenues [Abstract]’],
        ‘salesrevenuenet’: [‘Net sales’, ‘Revenue, Net’],

      • #193176
        Tim Bui

        Please upload your program to Github, Peter. I would learn a lot from it. I started to learn Python and Pandas and can get around Python comfortably. Would you also please send out your Github link? Thank you!

      • #193181
        Peter W Reed

        I placed the code snippets in a readme for a new project. Creating a .py file didn’t make sense since it wasn’t a full script or algorithm. Please find a link to the readme file below.

        XBRL_label matching.

      • #193182
        Tim Bui

        Thank you, Peter! I am studying the codes now

      • #193201
        Peter W Reed

        The xmlschema code snip will fail as written. It worked well for me because I had already defined the header in the BeautifulSoup program. I’ll edit the github to include the header definition. The SEC has an acceptable use policy that forbids (error code 403) the use of automated web scraping programs. The default header for request used in BeautifulSoup and urllib used by xmlschema identifies itself as a Python header. I believe they did not intent to prohibit programs using Python under user control like Arelle, or the programs I posted on github.

        Use the header definition provided below. Note there are Python modules that perform more sophisticated header rotation. Edgar doesn’t seem to need that level of spoofing to work.

        headers = {
        “Host”: “www.sec.gov”,
        “User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0”,
        “Accept”: “text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8”,
        “Accept-Language”: “en-US,en;q=0.5”,
        “Accept-Encoding”: “gzip, deflate, br”,
        “DNT”: “1”,
        “Connection”: “keep-alive”,
        “Upgrade-Insecure-Requests”: “1”,
        “Sec-GPC”: “1”,
        “Cache-Control”: “max-age=0”

        • This reply was modified 10 months, 2 weeks ago by Peter W Reed. Reason: typographical errors
      • #193203
        Peter W Reed

        Be careful with the xmlschema output. The xbrl parsing is correct, but printing the output is precarious. Multiple styles are used in creating the 10-Q/K. The output “xmlTmp_xbrl” is a dictionary but the keys and other data structure elements change year to year and company to company. The number of variations is small but a test for which variation is in use needs to be determined.

      • #201042
        Timur Mirzaev

        Hi Tim,

        Have you been able to resolve your issue with duplicate values once there is a tag change? I am faced with exactly the same issue.



      • #201043
        Tim Bui

        Hi Timur,
        I am still struggling to make the vast amount of data provided by the SEC be useful for my analysis. So far I have no solutions to standardize the financial items to make them comparable inter-industry (financial items have to be compared to others in the same industry) or overtime (when the tag names changed). Your question brings up an idea for me to ask the SEC staff if they provide a code to alert users if there are changes in the tag names or reporting format. So far, the SEC only provides a code if a company changes its name (Submission file, code ‘former’ and ‘changed’ in the Financial Statement Data Sets and Financial Statement and Notes Data Sets)



      • #201055
        Peter W Reed

        I hope this helps – In the “US_GAAP_Taxonomy_YYYY.xlsx” spreadsheet the tab ‘Elements’ has deprecation description. Columns ‘N’ and ‘O’ provide the information on a tags status.

        I gave up scraping the 10-K/Q for just the reason mentioned. The XBRL tag names change over time. So I went back to Yahoo! and Alpha Vantage.

        One reverse engineering activity I started doing was to map the Yahoo! element name to the XBRL tag name. The correlation id is the value. For example Alpha Vantage – Total Asset 10,000 and XBRL Asset 10,000. Good hunting!

      • #201057
        Tim Bui

        Thank you for pointing this out, Peter!
        As you pointed out col N and col O alerts the tags that were deprecated (referenced to col A which has the actual tag names that were deprecated).
        I think it will take some programming skills (way beyond my current ability) to match these deprecated tags and their years to the point-in-time data to line up the numbers properly.
        It is so frustrating to see there are so much data there but we individuals (at least those with low technical skills like me) can’t take advantage of them yet.

        Thanks again, Peter!

      • #201082
        Timur Mirzaev

        Hi, Tim and Peter!

        Thank you both for your replies. Could you please guide me to where I can get the Taxonomy xlsx file mentioned above? I went to XBRL.us Taxonomy page but the xlsx links (for 2020-2022 periods) were broken (“BlobServer.txt” was all I was getting).

        Thank you in advance!


      • #201086
        Peter W Reed

        Something on the SEC site changed, or I’m not remembering correctly. Below is the link I remember using to get the Excel files for taxonomies. All I’m seeing now are xml schema (.xsd) files.


      • #201089
        Tim Bui

        Timur, I tried to reply 3 times with the URLs but this text app probably does not allow URL.

        You can google “US GAAP Taxonomies 2022” and it will lead you to the FASB site to download the Excel sheet.

        You can also get the Excel sheet from this XBRL US website if you search for US Taxonomies

      • #201091
        Peter W Reed


        I didn’t have any success with the FASB website earlier today. That is where I remember getting it before, too.

        So I tried a different track and went to the IFRS site. The link below is for 2019. The information is presented in a different manner – it may even be better for your application. I tried using the “Link” function for this link. I hope it works,

        IFRS_landing page for 2019 taxonomy

    • #119959
      Tim Bui

      The trend in Gross Margin (GrossProfit / Total Revenue) says a lot about the business condition of a company. However not every company reports Gross Profit. For those who do not, we have to subtract Cost of Goods Sold from Total Revenue. I do not have a solution to get consistent numbers for Total Revenue yet, however, for Cost of Goods Sold, I have found a few different tags below:
      – CostOfGoodsAndServicesSold
      – CostOfRevenue
      – CostOfGoodsSoldExcludingDepreciationDepletionAndAmortization
      – CostsAndExpenses

    • #123764
      David Tauriello

      Tim – to get lists of concepts in the base (standard) US GAAP and IFRS Taxonomies, query: /dts/search?dts.taxonomy=US%20GAAP,IFRS&fields=dts.id,dts.taxonomy-name

      With each dts.id from above that you want concept information for, query like this:


      Every company references a base taxonomy in its filing as a starting point. The company filing is essentially a taxonomy that is inheriting from the base and adding concepts as necessary according to the company’s policies and practices with respect to its financial statements.

      NOTE: you could include concept.local-name with a comma-delimited list before the fields that are returned to get only specific concepts.

      If you pull an entire taxonomy, understand that there’s a great deal of information and a significant number of records involved, so this may take a while. Fortunately, these taxonomies don’t change – new releases annually – so the details only need to be pulled one time.

    • #124615
      Tim Bui

      Hi David,
      I finally got around to apply the codes that you described above. It’s just amazing how much information returned from the call.

      A couple questions please:
      1. Using =CONCATENATE(A1&”/concept/search?dts.id=257590&fields=concept.local-name.sort(ASC),concept.*,label.*”) only return about 2000 concept local names. How can I change the code so that I can get a complete list so that I can pick and choose what I need?

      2. I am still trying to find ways to shorten my code to get more factual data for more companies per call (to be under the length of the code that Google restricts). Can I use the concept.id in place of the concept.local-name in the call?

      As always, thank you for help!


    • #124655
      David Tauriello

      Use ENDPOINT.offset(integer) in the fields= portion of your query to get additional concepts. Something like concept.offset(2001) will show concepts after 2,000 (for Power User and Sole Practitioner Individuals, as well as all Organizational XBRL US Members). This offset works for all endpoints (fact, dts, etc.)

      See the documentation and this thread for more information: https://xbrl.us/forums/topic/how-to-get-a-sample-of-records-via-xbrl-api-for-evaluation/#post-116618.

      Again – because you’re querying for the base taxonomy concepts, the details you return will not change (these are published once), so your best/most effective approach will be to copy the details to a tab (or file) and use that as your reference.

      Like other “id” parameters, concept.id returns a unique integer that corresponds with concept.local-name so it can be used as a reliable substitute.

    • #124690
      Tim Bui

      Thank you, David. I was able to download 17,035 concepts. The next part is to understand these data and select which one to use.

      Thank you again for your help

    • #133617
      Tommy Carstensen

      Tim, did you ever manager to create a list of identical concepts? It’s a bit of a mess, when it’s not standardized. I’m surprised the SEC chose to allow companies to use random names as they see fit. It completely defeats the purpose of XBRL.

      • #133655
        Tim Bui

        Hi Tommy,

        I started on the standardization but have not finished it yet. At the beginning, I planned to use data from XBRL US but David Tauriello pointed me to the SEC website where I can get all of the data from all filers more efficiently (https://www.sec.gov/dera/data/financial-statement-data-sets.html). I am importing these data into a SQL Server to do standardization. I am learning the SEC data structure and think I have a way to do better standardization, but I need to test this method further. I would be happy to share my methodology with you if it works. In the meantime, if you want, I can give you what I have done so far, but it is incomplete. I use financial information for investing so I only standardize the items that I think relevant to my work.
        Yes, I agree that whichever entity (SEC, AICPA, CFA,…) that allows companies to name their tags (concepts) at-will really weaken the case for XBRL and disadvantage “smaller” financial data users like me. It’s illegal to fudge the numbers but accountants can name the numbers whatever they want and can change these names when it’s convenient–making peer comparison or historical comparison extremely difficult. Data providers such as Bloomberg, Factset, CapitalIQ, Thomson Reuters will be in business for a very long time.

      • #133680
        Tommy Carstensen

        Aye Tim, I think Morningstar and the rest of them will continue to thrive as long as the data is not standardised. I wanted to plot the data over time and across companies within an industry, but that turned out not to be trivial, because there is no requirement for the data to be homogenous over time and across industries. I hope the law and XBRL specifications are changed and data is homogenised going forward to enable small fintech companies to compete.

        Here my first attempt to plot data for General Mills over time:

        I shall be watching this thread to learn more about your attempts to standardise the data. I appreciate your efforts on behalf of the community. Thanks!

      • #139650
        D Q

        I too am interested in standardizing concepts for investing purposes.
        I don’t know SQL but DM me if I can help.

    • #139657
      Tim Bui

      Hello DQ, I am still working on standardizing the tags. I think I am on the right track, however this is not a trivial task. I was able do download 87MM (yes million) lines of data from the SEC. And now I am parsing them out using SQL Server. It is slow moving because I have to check and recheck to make sure the data match with the 10Ks and 10Qs. I am happy to give the results to whoever wants them because they are not proprietary data and I am too a beneficiary of communities like XBRL US (David Tauriello at XBRL US has spent a lot of time bringing me up to speed.)

      To use this massive amount of data, one will need to use some database for sorting and screening. Microsoft Excel or Access cannot handle this much of data. Maybe we all can put in a request to XBRL US to allow members to contribute to this standardization efforts by creating a depository area on XBRL US pgAdmin.

      Commercial data providers do provide their own standardization but depending one’s need, the standardization has to be customized somewhat. Sorry for this oxymoron word of customizing the standardization. But for example, companies reports several types of Account Receivables. There are 948 distinct tags on just AccRec. Most data providers have just 1 line for AccRec. I try to break them down to 4 subcategories: AccRec_Trade_Short_Term, Acc_Rec_Finance_Short_Term, Acc_Rec_Trade_Long_Term and Acc_Rec_Trade_Long_Term. Acc_Rec_Trades are the receivables from regular customers. Acc_Rec_Financing are the receivables from financing activities such as a promissory note coming due or GM financing the dealers’ floorplans. ST or LT determines whether they are in current assets or long term assets. The change of each of these subcategories provides different type of information to the financial readers.

      In the meantime, I would highly recommend you to check out the XBRL XL (https://xbrlxl.com/) website created by Jim Truscott. Jim also tries to do standardization. Jim had a demonstration of his Excel API hosted by XBRL US a few months back.

      Let’s hope we here from XBRL US on this matter.

      • #140245
        David Tauriello

        To use this massive amount of data, one will need to use some database for sorting and screening. Microsoft Excel or Access cannot handle this much of data. Maybe we all can put in a request to XBRL US to allow members to contribute to this standardization efforts by creating a depository area on XBRL US pgAdmin.

        I’ve raised this idea internally; another possibility might be creating some sort of common Google Sheet that holds these standardized terms (taking off from Peter Guldberg’s template for balance sheet)

    • #139658
      Tim Bui

      By the way, DQ, There is a company named Intrinio (https://intrinio.com/) that provides standardization on Excel. The prices seem to be very reasonable. I I tested their system out and found that their intereface is very to use. I think their data is suitable for most purposes. I do I own parsing raw data because I wanted to do further subsegments for my own use.

    • #162239
      Nathan Suderman

      I have been working in python to load all company XBRL filings into a SQL server. Now that I have the data available, I too am running into standardization issues with the tags(concepts). I don’t want to reinvent the wheel if there is already a mapping table that has been created.

      Has anyone already standardized/mapped the key tags?

    • #162295
      Tim Bui

      Hi Nathan, I too was able to upload the Financial Data Set into SQL Server, but even after spending a tremendous amount of time trying to standardize the tags, I have failed to come up with a workable solution. There are just way too many inconsistencies and variations in the way companies naming their tags. They name the tags to best describe their particular financial items at that moment in time and change the tag if their situations change. It is difficult to find consistency for the same company over time. It becomes more difficult when comparing tags of companies within the same economic sector or across sectors.

      Being an investor, I try to standardize the tags to get the granularity for financial analysis. Not sure what is your goal for standardization, but if you wanted, we can collaborate and share ideas on this issue. Tim

      • #179790
        Chinmay Laddha

        Hello Tim and Nathan, I am working on a project where anyone can populate the balance sheet, Income Statement, Cash Flow statement and some financial ratios for 6-7 years data into a google sheet in format/template and parameters of my choice. A python program where I could create a GUI, a user can select the company, parameters he/she want to see, then after selecting these parameters will be visible on the google sheet

      • #179795
        D Q

        Hi Chinmay, how do you solve the problem of standardizing the different tags?

    • #179792
      Tim Bui

      Hi Chinmay, I would love to see your project. While I have just begun to study Python so I am not sure if I can help in coding, but I do understand finance and financial analysis, so if I can assist in anyway, I would want to do that. Thank you!

    • #179799
      Tim Bui

      Hi DQ, if you have your method to standardize, maybe you can share with us. Here is how I do it, but I am not there yet.
      I download the 2020 US GAAP Taxonomy (https://xbrl.us/xbrl-taxonomy/2020-us-gaap/) using the xls format. The tab “Presentation” inside this spreadsheet lists out all of the US GAAP tags (“name” column) and their sources (“definition” column) in hierarchical order as listed by the numbers on columns “depth” and “order”. I exact similar tags and group them the way I want to see them on a the financial statements. However, since there are so many extensions in addition to the standard tags, I am still missing a lot of tags. I do all of the grouping in SQL Server.

      I get additional extensions by using the data downloaded from XBRL.US xsheet. David Tauriello said there is no source to get all of the extensions in one place as they are company specific.

      For grouping, I use names like ca_101_cash, ca_102_mrkt_sec, ca_103_restricted_cash,….

      So far I get about 26 tags for the cash group.

      • #193166
        Mikko Olkkonen

        I am developing my own special purpose standardization for few tags that are relevant for my specific use case. I uploaded my mapping into my github repo. I am interested in similar mappings that I could add into my version.

      • #193851
        Peter W Reed

        I searched github “olkkonen in:name” for your repository. There were no public repositories for me to view.

        I decided to have another go at the Excel downloads for 10-K and 10-Q. The approach to obtaining commonality/comparability between filings over time and between companies is context. Context being the sheet and sub-sections within sheets. I’ll let you know if I have success.

        In the meantime can you invite me into your private github? Or did I just not find the correct repo? Thank you.

      • #193861
        Mikko Olkkonen

        Peter, My repo should be public at

    • #179800
      Tim Bui

      Sorry for the typo on my note above. It’s not XBRL.US xSheet. It’s XBRLXL xSheet (https://xbrlxl.com/)

    • #186667
      Husein Kirefu

      I read through the xbrl.org site, and was left with the impression they had created, in junction with FASB & SEC, a standardized way of taging lines of financial data with what they call “concept”. However I could not find a xblr based General Ledger giving these “concept”.

      After reading this thread, am I correct in concluding there is no standard IFRS/GAAP/ect standard for pulling data across periods in a company and across companies?

      If so, what value is there in xblr for an investor if a company can change/create “concept” to reclassify a ledger item requiring a manual audit of the data for integrity?

      • #186684
        David Tauriello

        Hi Husein – thanks for taking a look at the information on our site and at XBRL International. It’s not clear from your post what specific data you’re looking for – our Public Filings Database is designed to return all data filed under the SEC’s requirement for public companies since 2009, as well as the base taxonomies published by the FASB and the IFRS. You can use the XBRL API to get started with the data (https://xbrl.us/xbrl-api-community) – we’ve posted documentation and several templates and tools that can help familiarize you with the data set. We also link to taxonomy viewers for the base taxonomies – see the links on the right side of https://xbrl.us/2020-us-gaap.

        The extensibility of the business reporting standard for reporting is one of its great strengths. The US SEC’s implementation for public company financials is an ‘open reporting’ environment (see the glossary of our Taxonomy Development Handbook – https://xbrl.us/tdh). In adopting the use of XBRL for public companies, the US SEC acknowledged its responsibility to ‘limit the use of extensions to circumstances where the appropriate financial statement element does not exist in the standard list of tags’ (https://www.federalregister.gov/d/E9-2334/p-520).

        We continue to work with both the FASB and the SEC to recommend taxonomy modelling for the US GAAP accounting standard and identify issues related to filing patterns. Limiting the use of extensions has been an on-going part of our discussions.

    • #186668
      Tim Bui

      Hi Husein, XBRL has done a tremendous job in bringing financial reporting data into the 21st century. However, from my perspective, this data reporting mechanism is helpful only to the filers and to the data vendors at this present time. Unless one knows a lot about programming, individual investors like me still do not have an easy way to use this information. Companies have so much leeway in using tags and extensions to describe their particular situations. Without standardization, it is impossible to properly do trend or comparative analysis.

    • #191253
      Tim Bui

      Hi Mikko, I am not a real programmer but here is how I get the Fin Data Set into my Postgresql (I do not use MySQL).
      – Within each quarterly release, there is a readme.html file that lists the fields of each of the 4 txt files. I create 4 tables in my Postgresql with these fields using the required datatypes. I use the fields of each table as listed on Section 3 (Organization) as primary keys.
      – I then use Python to clean up the data of each of the txt file and save as csv files with ‘~’ as delimiter (not comma because some fields have commas).
      – Finally I use the Copy command in Postgresql to import the csv files into Postgres.

    • #192501
      Matthew Beveridge

      Tim, Mikko, and others — I have a primitive (so far) python package I am working on to simplify manipulating this data by hosting the dera financials database remotely and standardizing query results into pandas dataframes. If you are interested in collaborating on this, don’t hesitate to reach out. I plan for an alpha release in the next few weeks.

      Functionality will be along the lines of:

      # import the package
      from mypackage import fundamentals
      # get the data
      form = fundamentals.ten_k('aapl', [2018, 2019])
      # manipulate the data
      debt_cap = form.debt_capitalization()
      margin = form.gross_margin()
      roe = form.return_on_equity()

      and so on. Results will look like:

      fy 2018 2019
      tag uom
      AccountsPayableCurrent USD 5.588800e+10 4.623600e+10
      AccountsReceivableNetCurrent USD 2.318600e+10 2.292600e+10
      AccruedIncomeTaxesNoncurrent USD 3.358900e+10 2.954500e+10
      AccumulatedDepreciationDepletionAndAmortization… USD 4.909900e+10 5.857900e+10
      AccumulatedOtherComprehensiveIncomeLossNetOfTax USD -3.454000e+09 -5.840000e+08
      … … …
      UnrecordedUnconditionalPurchaseObligationBalanc… USD 9.328000e+09 8.211000e+09
      UnrecordedUnconditionalPurchaseObligationDueAft… USD 6.600000e+07 1.100000e+08
      WeightedAverageNumberDilutedSharesOutstandingAd… shares 4.473200e+07 3.107900e+07
      WeightedAverageNumberOfDilutedSharesOutstanding shares 5.000109e+09 4.648913e+09
      WeightedAverageNumberOfSharesOutstandingBasic shares 4.955377e+09 4.617834e+09


      [251 rows x 2 columns]
      fy 2018 2019
      tag uom
      DebtCapitalization ratio 0.707029 0.732692

      • This reply was modified 11 months, 3 weeks ago by Matthew Beveridge. Reason: code formatting
      • This reply was modified 11 months, 3 weeks ago by Matthew Beveridge. Reason: code formatting pt 2
      • This reply was modified 11 months, 3 weeks ago by Matthew Beveridge. Reason: include example results
      • #192521
        Mikko Olkkonen

        Hi Tim, Matthew and others,
        I am now reading dera data (and European style ESEF XBRL) to mongoDB databases (i.e. I abandoned SQL). I have developed server functionality for querying the data with somebody. Our server code is mainly typescript, javascript/node.js. I guess that we have produced samewhat similar solution as Matthew has. Loose standardisation/use of the taxonomy/tags is our key problem even if we only need limited number of key items including ProfitLoss, Revenue, Cash, …
        Tim, Matthew: where do you collaborate? In Github?

      • #192528
        Matthew Beveridge

        Tim and Mikko, I generally use github. I can share the repo with you all if you PM me (it’s currently private). Otherwise, I’ll get things in order to make it public and post the link within the next week or so.

      • #193495
        Peter Miller

        Hi Maththew,

        may I ask what is your current state of the Python package? Did you upload it on GitHub?

        Kind Regards

    • #192505
      Tim Bui

      Hi Matthew, I would love to pitch in and help in anyway I can. After spending months on it, I almost gave up on the project because I couldn’t think through the proper way to standardize the financial items. I am a new student of Python, so hopefully I can learn from your Python program. Please let me know what steps I can do to help.

    • #192522
      Tim Bui

      Hi Mikko, thank you for reaching out. I would love to see your program. ​I created this github account.
      Is this what you are asking?

      Thank you!

      • #192541
        Mikko Olkkonen

        Tim, I am now following you on github. I have uploaded to my github account a bash script (deramongo.sh) I use for creating/maintaining my mongodb dera database. I have not yet uploaded the node.js server code (the node.js server offers http API for accessing the financials data stored in the mongoDB database).
        However, some version of server may be up and running at:
        vps-09403655.vps.ovh.net:8198/list returns company names and cik codes
        vps-09403655.vps.ovh.net:8198/firm?cik=1503518 returns financials data corresponding to a cik
        and so forth

    • #192544
      Tim Bui

      Thank you, Mikko! I just uploaded the files into my Postgres to check it out. I am also reading your DERA Financials data on Github right now. Thanks again

    • #193473
      Peter Miller

      Hello guys,

      I am a Python enthusiast and was trying in the past to clean XBRL data from the official SEC database. What is the current state? I can’t follow the SQL code on Tim’s GitHub. I would appreciate a short explanation.

      Kind Regards

    • #193475
      Tim Bui

      Could you elaborate what you are looking for or trying do to Peter?

      • #193476
        Peter Miller

        Thank you, Tim, for your replay.

        I am trying to clean/standardize the financial statements available on the SEC website. To my knowledge, these financial statements are in XBRL format and as the original question pointed out (to my understanding), the financial statements are not standardized and for several data points there are different key words (Revenues;SalesRevenueNet and so on).

        Have you guys found a way to standardize the data (with Python) to make it more useful?

      • #193501
        Peter W Reed

        I’m using the “XBRL TAXONOMY EXTENSION LABEL LINKBASE DOCUMENT” file that is part of the Edgar filing. It has the mapping from gaap tag to common name. The following comes from an Apple 10-Q filing. I use the Python module xmlschema to create a dictionary. The real work comes later cleaning the results for usability and extending this back five years.

        us-gaap_SalesRevenueNet_lbl = Revenue, Net
        us-gaap_SalesRevenueNet_lbl = Net sales
        us-gaap_Liabilities_lbl = Liabilities
        us-gaap_Liabilities_lbl = Total liabilities
        us-gaap_LiabilitiesAndStockholdersEquity_lbl = Liabilities and Equity
        us-gaap_LiabilitiesAndStockholdersEquity_lbl = Total liabilities and shareholders’ equity
        us-gaap_LiabilitiesAndStockholdersEquityAbstract_lbl = LIABILITIES AND SHAREHOLDERS’ EQUITY:
        us-gaap_LiabilitiesCurrent_lbl = Liabilities, Current
        us-gaap_LiabilitiesCurrent_lbl = Total current liabilities

      • #193502
        Peter Miller


        thanks a lot.

        Is it from your experience possible, that the same “us gaap tag” is assigned to different “common names” when it comes to different companies in different time periods?

      • #193504
        Peter W Reed

        Yes, even with the same company in different fiscal reporting periods for a given year. I was hoping doing the 10-K mappings would be sufficient for that year. It didn’t work that way for me.

        I’m still hoping the XBRL taxonomy files for a given year can provide a universal solution. I don’t have a background in accounting/finance, which may be required to solve this.

    • #193482
      Tim Bui

      Peter, I use 2 steps to do standardization, but I am no where close to the end.
      1. I use the US GAAP Classified Financial Statements created by Mr. Charles Hoffman (The Father of XBRL) to create standardized buckets.
      2. I get all of the tags from the SEC Financial Statement and Notes Data Sets and manually group the extensions (non GAAP standards) into the above buckets.
      XBRL US reply feature does not seems to allow me to paste the website URL, but you can send me a message on Linkedin and I can send you more info.

      • #193503
        Peter W Reed

        Thanks for the tip. I read everything I could find by Mr. Charles Hoffman. The “problem” I have with his terminology tables is they stopped being updated around 2015 (or earlier). Also, using the “XBRL TAXONOMY EXTENSION LABEL LINKBASE DOCUMENT” gives me the 10-Q/K mapping for the specific 10-Q/K for the company of interest.

        I got the xmlschema library working well. The software is in a pseudo code state – i.e. a hack. When I get it cleaned up and universal I’ll put it on github.

        By “Universal” I mean there are at least two variations on the Python dictionary keys used in the Edgar LABEL LINKBASE files.

    • #193483
      Peter W Reed

      There are a couple of contributors like Peter Guldberg who have accounting concept to GAAP name mapping. What I’m creating is accounting concept to GAAP name on a per company basis for their last five years of filings. I’ve found that concepts like revenue can have a different GAAP tag name over time for any given company. This entails creating a concept mapping to multiple GAAP tags. The link label file in the Edgar filing provides all the concept-to-gaap mappings for that company’s 10-Q/K.

      • #193484
        Peter Miller

        Hello Peter,

        < The link label file in the Edgar filing provides all the concept-to-gaap mappings for that company’s 10-Q/K.>

        where can i find the “link label file” on the Edgar website?

        Kind Regards

      • #193506
        Peter W Reed

        I use the Edgar search page to get to the list of a company’s 10-Q/K. Hope you recognize this task. On that page is a link to the company’s filings by period. If you click that link you get to the detailed Edgar filing. Below is an example for Apple. That page has the complete AAPL filing for that period. You’ll see the Linkbase file on that website page.


      • #193507
        Tim Bui

        Peter, The Linkbase file sounds so promising. I use the new Edgar filing page all the time, but I can’t find the Linkbase file any where. Would you please give more detail steps on how to get this file? Thanks!

      • #193513
        Peter W Reed

        It appears the Edgar website has methods in place to hinder automated programs. Also, I just tried to paste an image into my response. It failed.

        The best I can do is provide a url to Apple’s Edgar first page. Open up the 10-K hyperlink. You’ll see a “View all…” button, click there. Finally, click on an individual “Filing” button and you are there. I created a text cheat sheet that makes the clicks easier/faster. The files in the bottom half of the final page contains the linkbase, schema (WDSL), and the xml file version of the 10-K/Q. The XBRL! file is there too (click on it if you’ve not seen this before).


      • #193512
        Peter W Reed

        I just discovered the link I pasted in my response gets you to the Edgar page for 10-K/Q. Sorry, it is most likely done on purpose to foil bots. Use the link and then drill down to the 10-K/Q list. From that page you’ll click on individual filings. “Filing” is the name of the button to click.

        https://www.sec.gov/edgar/browse/?CIK=320193&owner=exclude Edgar landing page for AAPL

      • #193515
        Tim Bui


      • #193516
        Peter W Reed

        import pandas as pd
        import xmlschema
        import requests

        r = requests.get(xml_url,headers=headers)
        xs = xmlschema.XMLSchema11(c)
        xmlTmp_xbrl = xs.to_dict(d)

        The Python module I’ve had the best success with is “xmlschema”. The Schema (.xsd) file “c” is the r.text from browsing to that file. “d” is the r.text for the Linkbase, or XML Extract file. The result is a hierarchical dictionary.

        I had BeautifulSoup working a few months back, but it was messy due to all of the XBRL variations I found. The dictionary is easier/cleaner to use.

      • #193517
        Tim Bui

        Thanks for sending the Python instruction, Peter. It’s wonderful. This will help me a lot.

      • #193531
        Peter W Reed

        Glad it was helpful. Below is a cut and paste from a DataFrame I created using BeautifulSoup. It didn’t transfer well and not all columns are shown. The index is the gaap tag “us-gaap:treasurystockacquiredaveragecostpershare”. For this tag there are two common names used in the COST 10-Q/K from 2016 to 2021.

        There is a total of 140 rows in the DataFrame for COST. This is the intersection of all common names used throughout the study period, 2016-2021. My intention is to make each record complete and self-describing (eleven columns).

        Index Value Duration EndDate Ticker Type Period CmmNam0 CmnNam1
        acquiredaveragecostpershare 151 D 2016-11-20 COST 10Q Q1 Average price per share Treasury Stock Acquired, Average Cost Per Share

      • #193537
        Peter Miller

        Hello Peter,

        From my understanding, the created dictionary shows the US-GAAP tag but not the common name (Verbose Label). To link these, it would be necessary to have both.

        Kind Regards

      • #193539
        Peter W Reed

        The code below extracts the common name and the associated gaap tag name. The dictionary is “xmlTmp_xbrl” from the linkbase file. I drop two record types I’m not interested in. By creating two lists I’m able to easily create two dictionaries. One with gaap tag as the index (df2) and the second with the common name as the index (df1).

        Note, this code is for a COST 10-Q filing. The keys could be different for different stocks.

        gaapTag = []
        commonName = []
        for one in xmlTmp_xbrl[‘labelLink’][0][‘label’]:
        if re.search(“us-gaap_”, one[‘@xlink:label’]):
        if one[‘$’].endswith(“]”):
        elif one[‘@xlink:label’].endswith(“Abstract_lbl”) or one[‘@xlink:label’].endswith(“TextBlock_lbl”):
        df1 = pd.DataFrame(gaapTag, index=commonName, columns=[‘GAAP_Tag’])
        df2 = pd.DataFrame(commonName, index=gaapTag, columns=[‘Common_Name’])

      • #193542
        Peter W Reed

        I hope my code snip is clear (the dictionary is for AAPL, not COST). Let me know if you have any questions about the code?

        I went back to the standards document – “http://www.xbrl.org/specification/xbrl-2.1/rec-2003-12-31/xbrl-2.1-rec-2003-12-31+corrected-errata-2013-02-20.html&#8221;. Section addresses the labels. The first paragraph for that section is cut and pasted below:

        “Although each taxonomy defines a single set of elements representing a set of business reporting Concepts, the human-readable XBRL documentation for those concepts, including labels (strings used as human-readable names for each concept) and other explanatory documentation, is contained in a resource element in the label Linkbase. The resource uses the @xml:lang attribute to specify the language used (via the XML standard lang attribute) and an optional classification of the purpose of the documentation (via a role attribute).”

      • #193556
        Tim Bui

        I am grateful for your sending the codes, Peter! I am out of town, but will study it when I get back. Thanks again!

      • #193567
        Peter W Reed

        Be forewarned the snip I provided was only tested for “AAPL” 2014. I’m now testing for “AAPL” 2015 I found the labels have changed. I provide below the salient change that needs to be made to the program. I ran into this when I was using BeautifulSoup. I found the number of variations is small but it caused my program to look like spaghetti. This is why I moved over to xmlschema and a well formed dictionary.

        2014: xmlTmp_xbrl[‘labelLink’][0][‘label’]
        2015: xmlTmp_xbrl[‘link:labelLink’][0][‘link:label’]

    • #193485
      Tim Bui

      Peter, Every book has an ISBN number, but financial tags are in text form. Not having numbers and the names of tags changing over time and not comparable across companies making standardization very difficult–at least for people who lack programming skills like me.

      Peter Gulberg is great. Another person who might be able to help you is Jim Truscott (xbrlxl.com). Both Peter Gulberg and Jim Truscott helped me a lot on standardization. I recommend you to check out Jim’s XBRLXL website on this topic.

      On standardization, the issue for me is how detail one wants to get. For example, one might look at Total Account Receivables, but I want to break AR further down to AccRec-Trade (from customers), Acc_Rec_NonTrade (from other sources such as tax refund), Acc_Rec_Finance_Oper.

    • #193505
      Peter W Reed

      Thanks for the tip on Jim Truscott. I started writing some scripts to populate Peter Gulberg spread sheets. But quickly saw I’d run into the same common name, XBRL tag name problem. I think I have resolved my LinkBase scripts to create. I’m anxious to begin constructing Machine Learning programs which is my primary goal.

      Have you looked at the excel sheet in the XBRL taxonomy Excel workbook calling out depth for each gaap tag? I hope to use that sheet to ensure features (gaap tags) are independent in a financial sense of the word.

      • #193597
        Mikko Olkkonen

        Peter, Is the goal of your machine learning program to find tags corresponding to certain higher level concept such as Revenues or Free Cash Flow. That is basically what I am trying to do as follows:
        1) I am turning financial statement data (https://www.sec.gov/dera/data/financial-statement-data-sets.html) into a local mongo database . For this I am using my deramongo.sh script that can be found in my github.
        2) I output all tag-commonname pairs from my local mongo database by using my script mapping.js (sample output map3.txt of about one million tag-common name pairs is on my github)
        3) from that map3.txt I can query any tag. My algorithm searches all common names corresponding to that given tag. Next it searches for new tags corresponding to each of the common names found in the previous stage. Next new common names are searched for all found tags etc. For example, if I query Revenues tag, this “fuzzy logic” algorithm converges to synonym tags like Revenue, SalesRevenueNet, SalesRevenueGoodsNet, RevenueFromSaleOfGoods and many others some of which are unfortunately clearly of poor quality i.e. not really synonyms for Revenues.
        Anyway, I think that this way I can produce adequate buckets for some tens of concepts that are of interest to me. I am very interested in hearing if your machine learning program can do something like this. Maybe my (almost complete) inventory of tag-common name pairs (map3.txt) is relevant input to your machine learning program?

    • #193606
      Peter W Reed

      Your approach is more sophisticated than mine. I’ll look at your github this weekend. On the Edgar page for a company’s 10-Q/K filing there is a Schema file (.xsd) and five xml files. I use the Python module ‘xmlschema’ to parse the linkbase file for common name to gaap tag mapping. The result is a dictionary with about 500 records, most of little interest that need to be cleaned. Below is one record of 500 in the dictionary. I use tuples for file storage.
      (‘total net sales’, ‘lab_us_gaap_revenuefromcontractwithcustomerexcludingassessedtax’)

      My ML algorithm interest isn’t in the creation of the mapping. InsteadI’ve been using Clustering to analyze large groups of stocks, like the EFT QQQ. I need a method to gather a large number of stocks’ common attributes (“Net Sales”) over a long period. The investment strategy I use and want to improve upon is selecting the “best” twelve stocks that make up an ETF like QQQ.

    • #193908
      Peter Miller

      Hi guys,

      since our goal is to create standardized financial statements, I wonder if it wouldn’t be advantageous to use the “Bulk data” zip-file which can be downloaded on the SEC website. It might contain in json format all the financial data which is available for all companies.

      Isn’t it easier to map these common names to standardized names instead of linking the xbrl tags to standardized names? What are your thoughts?

    • #193918
      Peter W Reed

      I don’t understand the distinction between Common names and standardized names. I use the “XBRL TAXONOMY EXTENSION LABEL LINKBASE DOCUMENT” that comes with each 10-Q/K submittals. A reason I use this xml file is due to custom labels. For Example, “lab_us-gaap_ComprehensiveIncomeNetOfTax” is mapped to “COMPREHENSIVE INCOME ATTRIBUTABLE TO COSTCO” in their latest 10-K filing.

      The work Mikko Olkkonen is doing on github looks promising for standardization. Note: Mikko there is a broken url in your readme. I’ll submit a comment on github later.

      What I’ve found with the bulk data and API material is timeliness. Its seems about a quarter out of date.

      Back to your question on standardization of common name to gaap tag. Below is a snip from and API call I submitted. The mapping is here, but it is the same mapping found in the LABEL LINKBASE DOCUMENT file.

      {“cik”:909832,”taxonomy”:”us-gaap”,”tag”:”GrossProfit”,”label”:”Gross Profit”,”description”:”Aggregate revenue less cost of goods and services sold or operating expenses directly attributable to the revenue generation activity.”

      • #194068
        Tim Bui

        Hello Peter Reed and Peter Miller, I am not a programmer (I am learning Python and XML so I can understand Peter Reed’s codes) so I don’t really use the JSON files that the SEC provided in its CompanyFacts zip folder. The way I understand it, we have to define or create our own standardization, depending on the level of detail we want. The SEC only provides the “common name” as shown on the financial statements. Due to my limited programming skills, rather than using the JSON files, currently I download the monthly Financial Statement and Notes Data Set (https://www.sec.gov/dera/data/financial-statement-and-notes-data-set.html) and import those files into Postgresql. However, eventually it seems that I need to learn JSON so I can use those daily JSON files to update data.

    • #194070
      Peter W Reed

      My experience is that any json downloads like the one in your posting can be treated like a dictionary. I looked at the page you provided and that’s what I saw was a hierarchical Python dictionary.

      I apologize for this next statement – “please don’t take any of the code I’ve provided as anything more than pseudo-code. It is not a final product, and I may choose to drop that line of inquiry and go in a different direction.”

      I hit a real dilemma yesterday. I found that knowledge of context (“contextRef”) is required. A 10-K I was parsing I associated common name with a us-gaap tag name. The problem was I found that common name to tag pairing occurred twice for two entries. This can be seen in the Edgar supplied excel file. A human reader can understand the double entries but not a simple software program. The “contextRef” in the XBRL file shows the second pair is part of a segment of the 10-K. The first pair is in the main body (not a segment) of the filing.

      • #194071
        Tim Bui

        Hello Peter Reed (sorry I have to use your last name as well because there are 2 Peters on this discussion),

        Could you please elaborate the point ” I found that knowledge of context (“contextRef”) is required”? I am not familiar with “contextRef”. Are you thinking about the debit/credit notation of each of the tag?

      • #194308
        Peter W Reed

        I believe I found the solution. There are tables for the 10-K/Q reports on Edgar that are not immediately seen using a browser. Below is my description for retrieving the reports. I assume knowledge of web scraping via BeautifulSoup, request.urlopen(), and regex (re). Paste the URL into your browser to see the table. You can use the “inspect” function of the browser to see the parameters to scrape: Here is a summary –

        URL for reports:
        cik: 0000909832 # With leading zeros removed, e.g. 909832
        SEC Accession No.: 0000909832-18-000022
        accessionNoHyphen: 000090983218000022
        Report Number- rptNum: “R{1-9+}”

        Scraping this report will yield: (1) Common Name, (2) XBRL tag name, (3) Numerical value for the common name entry, (4) Credit type (debt/credit, (5) Period type – duration or instant, (6) Definition for the XBRL tag.

        When I great solid code to share I’ll put it on github. This may take a few weeks.

        • This reply was modified 9 months ago by Peter W Reed. Reason: URL too long for a single line
      • #194325
        Peter Miller

        Hello Peter Reed,

        First, I would like to thank you, that you haven’t given up yet. I was investing my last two weeks of my holidays into the bulk data, which contain the .json files for all the companies.

        Before I comment on your last posting, I would like to share some possible problems on the bulk data/.json files:
        The .json file seem to be incomplete sometimes, when multiple sub-terms are summed up to one term. For instance, in the last PG 10-K Filing the sub-terms “Buildings”, “Machinery and equipment”, “Land” and “Construction in progress” are having to by current knowledge all the US-GAAP tag “us-gaap:PropertyPlantAndEquipmentGross”. This US-GAAP tag is also used for the summary for these sub-terms named “TOTAL PROPERTY, PLANT AND EQUIPMENT”. In the final .json file I could only find the summed up value, consequently under the US-GAAP tag “PropertyPlantAndEquipmentGross”. The values of the sub-terms are not provided, probably because they share the US-GAAP tag.
        It might be furthermore a problem, that the filling classification (10-K, 10-Q) is sometimes not correct. I have seen data points, where between the start and end date was only 3 Months, but the filling was classified as a 10-K filing. That was a problem when no start date was reported and only an end date. I solved this issue, partly, by iterating through every US-GAAP tag and searching for filings where a start and end date was provided. When the time delta between the start and end was than >335 days and smaller than 395 days, then the month of the end date was considered as the fiscal year end of that specific year. But as you can imagine, some companies didn’t report a start and end date at any of the us-gaap tags in the .json file, and consequently no fiscal year-end month could be calculated.

        All in all, I am extremely disappointed that some sub-terms, which are used to calculate a data point, are not reported in the .json files. If someone has a solution for that, please share.

        Regarding your last posting:
        The tables at the links you have provided are not providing any information in which entity the numbers are displayed. So, for instance, in some tables the numbers are provided in millions and in others in thousands.
        These links aren’t available for all the fillings in the past, which are in the .json files.

        In pandas 1.30 the read_xml function was included. I have no experience with .xml files, but maybe this function can be used to extract us-gaap-tag, common name and entity of all numbers included in the official XBRL file.

      • #194331
        Peter W Reed

        Thanks Peter for your observations. Your frustration with consistency is why I dropped getting data from Yahoo and other such sites. I thought if I went to the source I’d be better off. I also tried the two Python modules for Yahoo financials, but found they sometimes displayed bad information too.

        Couple of items:
        – Yes, the SEC reports I pointed to have a problem. The problem is they display “Segment” information as well as the desired “not segmented” information. The way I’ve dropped that information is naïve keeping only the first instance. The XBRL “htm.xml” file identifies segmented and not segment data. I hope we don’t have to use that file too, thanks for the warning.
        -I’ve had the best success with xml parsing by using the xmlschema module. Using the .xsd file as the schema definition file.
        – I’ve always assumed that shares are in 000’s and money is in 000000’s. In the reports I mentioned in my last post the “Data Type” for each tag is defined with “xbrli:SharesItemType” or “xbrli:monetaryItemType”. I was thinking that would identify the units too. I’ll look out for the exceptions.
        – I’ve only been using a small subset of companies to test with so I haven’t seen the odd behaviors you mention. Mostly I’m only interested in % change. I’m hoping those oddities smooth out by normalizing with %.

      • #194878
        Peter W Reed

        Apologize my last reply to you was in error. I said that I hadn’t seen a change in units for money or shares. The excel based html I’m using from the edgar site indeed has such unit changes. The good news is the units are documented in by the index column. Here are the two variations I have run across – (1) “Condensed Consolidated Balance Sheets – USD ($) $ in Millions”, and (2) “Condensed Consolidated Balance Sheets – USD ($)”. My logic needs to use the absence of a “millions” unit to mean there will be six zeros that need to be stripped. The units for shares follows the same logic.

        It has taken me a long time to get the excel-based Edgar data to work well. A lot of data cleansing was required. Here is a sample page –

        url =”https://sec.gov/Archives/edgar/data/0000909832/000090983221000008/R{x}.htm”
        where {x} takes on values from 1 to 8.

      • #194679
        Peter W Reed

        I decided to drop the parsing of the XBRL file and instead concentrate upon the scraping of the reports I mentioned in another post. To answer your question now I want to show you the way the context is shown in those reports. Below is a table (DataFrame) that comes from the report. One table for each XBRL element within the report. The 4th row is the same information as in the “contextRef”. This attribute isn’t needed for parsing the report. The value of the PeriodType is in processing. The 10-K report’s ‘duration’ parameters are a summation of all four quarters. If you want to know the 4th quarter value for that parameter, subtract the first three quarters. The 10-K ‘instant’ parameters are a snapshot of the company as of the end date, no processing should be made. Hope this helps!

        0 1
        0 Name: us-gaap_CommonStockDividendsPerShareDeclared
        1 Namespace Prefix: us-gaap_
        2 Data Type: num:perShareItemType
        3 Balance Type: na
        4 Period Type: duration

    • #194074
      Peter W Reed

      I thought about your json task. I haven’t tried this command before but I like Pandas as a data structure – pandas.io.json.read_json. This command should load your file into a DataFrame. It is a large file. I don’t know if that will be an issue for you.

      On contextRef – In 10-k/Q filings on Edgar there is included a file titled “Extracted XBRL instance document”. Below is a link to a Costco 10-K XBRL file. Below that link is an extract from the XBRL file showing the Document Type (“10-K”). You can see the attributes including the contextRef element. There is a second contextRef in the file, see the second example. The difference between the two is identified by the timeframe – ‘D20190902-20200830’ and ‘I20200929’. The duration is for gaap tags like net sales which accumulates over time 09/02/2019 to 08/30/2020. The instant gaap tag is for attributes like “Assets”. It indicates a snapshot in time of the company’s asset, etc..


      <dei:DocumentType contextRef=”i66090f23f2724bb78bc8f6b7c7081361_D20190902-20200830″ id=”id3VybDovL2R…-0507782e341b”>10-K</dei:DocumentType>

      <dei:EntityCommonStockSharesOutstanding contextRef=”i270a3479d6ec49b29cb9fa2ac0b615bb_I20200929″ decimals=”INF” id=”id3VybDovL2R…-97e8-f700a530d6e0″ unitRef=”shares”>441228027</dei:EntityCommonStockSharesOutstanding>

    • #194075
      Peter W Reed

      I had forgotten that I looked at the data set file before. Here is a code snippet that could help you. dicAcc is a python dictionary.
      Note the parameter “headers”. Python url libraries put Python as the user-agent. You need to overwrite that, else Edgar will reject your request. I also remember having problems due to the file sizes. I broke the Edgar data file into five segmented files. At the bottom is the code I used to parse the five files (A,B,C,D). compress[“AAPL”] returns Apple’s cik.

      Get file-
      # Get the accession number to Ticker mapping file from Edgar
      targetUrl = “https://www.sec.gov/files/company_tickers.json&#8221;
      r = requests.get(targetUrl, headers=headers)
      dicAcc = r.json()

      headers = {
      “User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0”

      Parse file A:
      cik = str(compress[‘AAPL’])
      aaplLst = []
      with open(filePathA, “r”) as f:
      for line in f:
      if re.search(cik, line):

    • #194763
      David Tauriello

      Hi Tim and all on this thread – thanks for a very healthy discussion on this topic. I hope you’ll join our free webinar later this month, Working with SEC Data for Analysts for tips & strategies in working to create normalized views of US GAAP data.

      • #194771
        Tim Bui

        Thank you for inviting, David. The topic of the webinar seems to match with what I am trying to study. I will sign up. Regards,

    • #197110
      Sammeer S Raawat

      Hi Tim Bui (initiator of the thread) and all,

      Following is the list of all TAGS, LAbels and their occurrences in 20210930 for revenues

      In my humble opinion, it is best to download the data available in flat files and then work on it. All the relationships and description of all elements is very well defined and documented.

      RevenueFromContractWithCustomerExcludingAssessedTax Revenue 774
      Revenues Total revenues 742
      Revenues Revenues 480
      RevenueFromContractWithCustomerExcludingAssessedTax Revenues 464
      RevenueFromContractWithCustomerExcludingAssessedTax Total revenue 431
      Revenues Revenue 406
      Revenues Total revenue 394
      RevenueFromContractWithCustomerExcludingAssessedTax Total revenues 355
      RevenueFromContractWithCustomerIncludingAssessedTax Revenue 132
      RevenueFromContractWithCustomerExcludingAssessedTax Net revenues 125
      Revenues Total Revenues 115
      RevenueFromContractWithCustomerExcludingAssessedTax Net revenue 108
      RevenueFromContractWithCustomerExcludingAssessedTax Sales 102
      Revenues Sales 85
      RevenueFromContractWithCustomerIncludingAssessedTax Revenues 83
      Revenues REVENUES 66
      Revenues Net revenues 64
      Revenues REVENUE 62
      Revenues Net revenue 57
      RevenueFromContractWithCustomerIncludingAssessedTax Total revenue 57
      RevenueFromContractWithCustomerIncludingAssessedTax Total revenues 57
      Revenues Total Revenue 46
      Revenues TOTAL REVENUES 43
      RevenueFromContractWithCustomerExcludingAssessedTax REVENUE 41
      ProceedsFromSaleOfAvailableForSaleSecuritiesDebt Sales 38
      RevenueFromContractWithCustomerExcludingAssessedTax REVENUES 36
      RevenueFromContractWithCustomerExcludingAssessedTax Total Revenues 32
      RevenueFromContractWithCustomerExcludingAssessedTax Total Revenue 29
      Revenues TOTAL REVENUE 23
      RevenueFromContractWithCustomerIncludingAssessedTax Net revenues 14
      Revenues NET REVENUES 14
      RevenuesNetOfInterestExpense Net revenue 12
      RevenuesNetOfInterestExpense Net revenues 12
      RevenueFromContractWithCustomerIncludingAssessedTax Net revenue 11
      Sales Sales 11
      RevenueFromContractWithCustomerIncludingAssessedTax REVENUE 10
      RevenueFromContractWithCustomerIncludingAssessedTax Sales 10
      Revenues SALES 10
      Revenues Net Revenue 9
      RevenueFromContractWithCustomerExcludingAssessedTax NET REVENUE 8
      RevenueFromContractWithCustomerExcludingAssessedTax Net Revenues 8
      Revenues Revenues: 8
      RegulatedAndUnregulatedOperatingRevenue Total revenues 8
      RevenueFromContractWithCustomerExcludingAssessedTax Net Revenue 6
      Revenues NET REVENUE 6
      RevenueFromContractWithCustomerExcludingAssessedTax TOTAL REVENUE 6
      RevenueFromContractWithCustomerIncludingAssessedTax Net Revenue 5
      RevenueFromContractWithCustomerIncludingAssessedTax Total Revenue 5
      RevenueFromContractWithCustomerExcludingAssessedTax NET REVENUES 4
      RevenueFromContractWithCustomerIncludingAssessedTax Net Revenues 4
      Revenues Revenue: 4
      RevenueFromContractWithCustomerIncludingAssessedTax REVENUES 4
      SalesRevenue Sales 4
      SellingExpense Sales 4
      RevenueFromContractWithCustomerIncludingAssessedTax TOTAL REVENUES 4
      Revenues REVENUE: 3
      GrossProfit NET REVENUE 2
      GrossProfit Net Revenue 2
      GrossProfit Net revenue 2
      NetRevenues Net revenue 2
      RevenueFromContractWithCustomerIncludingAssessedTax NET REVENUE 2
      GrossProfit Net Revenues 2
      NetFinancingRevenuesExcludingOtherRevenues Net revenues 2
      Revenues Net Revenues 2
      RevenuesNetOfInterestExpense Net Revenues 2
      SalesTypeLeaseRevenue Net revenues 2
      InterestAndDividendIncomeOperating Revenue 2
      PremiumsEarnedNet Revenue 2
      RevenueFromCollaborativeArrangementExcludingRevenueFromContractWithCustomer Revenue 2
      RevenueRevenueAdjustmentsFromContractWithCustomerExcludingAssessedTax Revenue 2
      RevenuesNetOfInterestExpense Revenue 2
      RevenueFromContractWithCustomerExcludingAssessedTax Revenue: 2
      InvestmentBankingRevenue Revenues 2
      OperatingLeaseLeaseIncome Revenues 2
      OperatingLeaseLeaseIncomeLeasePayments Revenues 2
      RevenuesfromOtherParties Revenues 2
      RevenuesIncludingFederalExciseTax Revenues 2
      RevenuesNetOfRealizedAndUnrealizedGainsLossesOnInvestments Revenues 2
      RevenueFromContractWithCustomerExcludingAssessedTax Revenues: 2
      ProceedsFromCustomers Sales 2
      ProceedsFromSaleOfMortgageBackedSecuritiesMbsCategorizedAsTrading Sales 2
      RevenueNotFromContractWithCustomerOther Sales 2
      SaleRevenueNet Sales 2
      SellingAndMarketingExpense Sales 2
      RevenueFromContractWithCustomerExcludingAssessedTax Sales revenue 2
      Revenues Sales revenue 2
      GrossProfit TOTAL REVENUE 2
      RevenueFromCollaborativeArrangementExcludingRevenueFromContractWithCustomer Total revenue 2
      RevenuesNetOfInterestExpense Total revenue 2
      TotalRevenue Total Revenue 2
      TotalRevenue Total revenue 2
      TotalRevenuesIncludingRevenueGeneratedByVariableInterestEntities Total revenue 2
      CostOfRevenue Total revenues 2
      InterestAndFeeIncomeLoansAndLeasesHeldInPortfolio Total revenues 2
      RevenueFromContractWithCustomerExcludingAssessedTax TOTAL REVENUES 2
      RevenueFromContractWithCustomerIncludingAssessedTax Total Revenues 2
      RevenueNotFromContractWithCustomerOther TOTAL REVENUES 2
      RevenuesExcludingInterestAndDividends Total revenues 2
      RevenueFromContractWithCustomerExcludingAssessedTax REVENUES: 1
      ProceedsFromBankLoanParticipations Sales 1
      ProceedsFromLimitedPartnershipInvestments Sales 1
      ProceedsFromSaleAndMaturityOfAvailableForSaleSecurities Sales 1
      ProceedsFromSaleOfEquitySecuritiesFvNi Sales 1
      ProceedsfromSaleofEquitySecuritiesFVNI Sales 1
      ProceedsFromSaleOfEquitySecuritiesWithoutReadilyDeterminableFairValues Sales 1
      ProceedsFromSaleOfEquitySecuritiesWithReadilyDeterminableFairValues Sales 1
      ProceedsFromSaleOfShortTermInvestments Sales 1
      ProceedsFromSaleOfTradingSecuritiesHeldforinvestment Sales 1
      RevenueFromContractWithCustomerIncludingAssessedTax SALES 1




    • #197128
      Tim Bui

      Hi Sammeer, Thank you for sharing the list of tags for the revenue category. I have been trying off and on to create a list of standardized financial items without success because there are soo many variations in the names of the tags, as your list shows 110 ways of describing revenues. I think the accounting authorities allow companies too much flexibility in naming their tags to describe their particular financial situations. The liberal descriptions are useful in describing the uniqueness of items of individual companies , it is extremely difficult to do aggregation for peer comparisons. Individual investors like me who lack sophisticated programming skills still are not able to take advantage of the digitization of financial data.

    • #197135
      Peter W Reed

      I’m in alignment with Tim. I think it has more to do with experience in accounting than programming skills. The material Charles Hoffman has written on use of XBRL is a couple of skill levels above me. After two years of trying I gave up and decided to use the freely available normalized financials available from Yahoo, Morningstar, Macrotrends, Alpha Advantage, etc.. Their information has some flaws for example the reporting dates can be off by days – they tend to use end of month instead of the true date. I believe these outlets use the same source since I see the same flaw show up in multiple providers.

      Good luck!

    • #197148
      Mikko Olkkonen

      Hi, I guess that I am more in alignment with Sammeer’s approach. I am interested in only twenty or so of the main items (Revenue, cashflow from operations, cash, equity, …). Therefore, standardisation in my case is doable with some combination of helper scripts and manual work. My approach leads to information that has smaller scope and lower quality than the information you can find at Yahoo, Morningstar etc. However, licences by Yahoo etc tend to prevent liberal usage of information extracted from their sites. Therefore, crawling the information from their sites is not applicable in my case even if automated extraction from their sites could be technically doable.

      • #197153
        Peter W Reed

        Mikko thanks for your thoughts. The biggest difference is our use of the information. I can appreciated your need for unfettered open-source property. Frankly I’m a little disappointed the SEC hasn’t already done this. My use case is private individual use.

        I don’t know if I shared this. I know I can “easily” get the mapping from gaap terminology to ‘normal’ terminology. Using this mapping to my advantage is another thing. I realized what I created was no better than subscribing to Yahoo. Except Yahoo is less complicated to maintain and cleanse.

        Edgar produces excel-like reports for every 10-K/10-Q. The url below is the base (part 1 + part2). The parameters in the link are easily gained. Where {report} is {R1, R2, R3, …,} R1 contains all of the reports details like report type {10-K, 10-Q}. Give the link a try. Here is a sample for AAPL part2 – 0000320193/000032019321000056/R1.htm

        part 1: sec.gov/Archives/edgar/data/
        part 2: {cik}/{accNoHyph}/{report}.htm

    • #197160
      Tim Bui

      Hi Peter, could you please share the steps that you do to map the gaap tags to your terminology? I need to learn to do that in Postgresql.

      The level or depth of standardized or classified financial items is very much depending on what one tries to seek. For example. I would like to see further breakdown of total debt into corporate debt, mortgages, or capitalizes debt tied to leased assets. While all debts are obligations, the nature of these debt says a lot about the company’s financial risks and financing strategy.

      Thank you. Peter

    • #197257
      Peter W Reed

      I believe you will find this reference of value. Alphavantage is who I decided to use for my personal purposes. I found of particular interest the GAAP Taxonomy Mappings.
      Similar concepts reference material

    • #197258
      Tim Bui

      Hi Peter, when I click on that “Similar concepts reference material” nothing happens. Is it a website?
      I have to come up with my own standardized list, however, I need to learn how to groups the tags so that I can assign to the standardized items in Postgresql. I am just looking for a methodology on how to do mapping of tags to standardized item. Thank you, Peter

    • #197259
      Tim Bui
    • #197260
      Peter W Reed

      Thanks Tim. An excellent reference too.

      To respond to your reply earlier this week. Here is a short explanation for my approach to getting terminology mapping and 10-Q/K results. I hope to get the code clean enough to post on a public site, github.

      The first part of all the URLs below is “www.sec.gov/”. My approach requires 4 steps. Once programmed it isn’t as ugly as the description reads I hope.

      1) “cgi-bin/browse-edgar?action=getcompany&CIK={ticker}&type={Report-Type}”

      2) Scrape the web page from step #1 for the CIK and Accession Number(s). Replace the hyphens in the Accession Number with null, “”. The result I call “accNoHyph”

      3) The general syntax of the url for the company’s reports is: “Archives/edgar/data/{cik}/{accNoHyph}/{report}.htm”.
      a) For example, Victory Secret 10-Q Statement of Equity is r = “Archives/edgar/data/1856437/000185643721000018/R6.htm”

      4) Everything else is accomplished using BeautifulSoup. I assume you have familiarity with that module.
      a) first action after downloading page in (3) is: td_find = r.find_all(“td”, {“class”: “pl”})

      b) Next search and match with (i) re.search(re.compile(r’defref_us-gaap_(\w*)’),str(x)):. Then (ii) match = re.search(re.compile(r’defref_us-gaap_(\w*)’),str(x)). The group() function for re is used to get the “us-gaap” tag.

      exception to (b) The attributes on the cover page (R1) in XBRL terminology is “dei”. All other reports {R2-R7} contain “us-gaap” attributes.

      c) A sample result for (4b) symbol VSCO is below, note both concept terminology and XBRL terminology for this element is present:

      Total Comprehensive Income (Loss)

      A child tag has the result for “Total Comprehensive Income (Loss)” which is (358).

      Notes for each numbered item above. Remember “www.sec.gov/” is the prefix:

      1) Victoria Secret URL is “/cgi-bin/browse-edgar?action=getcompany&CIK=VSCO&type=10-Q”

      2) Scrape the web page from step #1 for the CIK and Accession Number(s). Replace the hyphens in the Accession Number with null, “”. The result I call “accNoHyph”. The Accession numbers reference a specific quarter or annual report.

      3) Victory Secret report on Statement of Equity is “Archives/edgar/data/1856437/000185643721000018/R6.htm”
      The {report} attribute is R1 to R7. R1 is the 10Q/K cover page. The remaining reports {R2-R7} are the cash flow, balance, and income statements. When you process R1 you need to modify the search in step 4 from “us-gaap” to “us-dei”.

      4) When you process R1 you need to modify the search in step 4 from “us-gaap” to “dei”. “dei” is the XBRL standard tag prefix for company info, e.g. “dei_DocumentPeriodEndDate”.

      5) The header row returned in step 4 contains the name of the report, for example “… Income Statement…”. And very importantly the phrase ‘in millions’. Two styles of numbers are used those with and those without the millions trailing six zeros.

    • #197261
      Peter W Reed

      Apologize for a broken link. I hope this works better. This is reference material posted on the Alphavantage website.

      You might also find their Academy of particular interest on AI, Django, and other topics of interest.

    • #197262
      Tim Bui

      Thank you for the detailed explanation, Peter! I am going through the steps now so learn now. I am grateful!

Viewing 46 reply threads
  • You must be logged in to reply to this topic.


Upcoming XBRL US Events

Center for Data Quality Committee Meeting
Wednesday, June 29, 2022

Domain Steering Committee Meeting
Tuesday, July 19, 2022

Communications & Services Steering Committee Meeting
Tuesday, July 19, 2022

Public Review: Annual Comprehensive Financial Report Taxonomy
Monday, August 15, 2022